Leveraging LLM’s multi-head attention for document retrieval #RAG


Summarise this content to 300 words

Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents as embeddings of the necessary documents can be far apart in the embedding space, complicating their retrieval. For addressing it, this paper[1] introduces Multi-Head RAG (MRAG), a novel scheme that leverages activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents.

Key contributions:

  • novel idea to use activations of the multi-head attention part of the decoder block as embeddings facilitates capturing the potential multi-aspectuality of the data without increasing space requirements compared to standard RAG
  • Established evaluation methodology, full data construction and query processing pipeline that implements the multi-aspect embedding idea
  • datasets used facilitate broad evaluation by considering both fully-automatically generated, synthetic data and analyzing specific industry use cases that show the benefits of MRAG
  • MRAG and its benchmarking principles can be seamlessly integrated with both existing RAG solutions and benchmarking frameworks such as RAGAS
  • instead of the single activation vector generated by the last feed-forward decoder layer for the last token, we harness the H separate activation vectors generated by the last attention layer for the last token, before processing it via matrix Wo (linear layer that combines the outcomes of all the attention heads)
  • can be formulated as a set of embeddings S = {ek∀k} where ek = headk(xn), which is simply the set of all outputs from the attention heads on the last token xn of the input
  • As processing with multiple heads does not change the size of the output vector, S has the same space requirements as standard RAG

Figure below shows summary of pipeline design

i) MRAG Pipeline Overview

a) Data preparation

  • populate a data store with multi-aspect MRAG text embeddings and their corresponding documents or text chunks
  • create the multi-aspect embedding of each text chunk using a selected decoder-based embedding model
  • user of the pipeline can plug in their model C choice as well as use their input data
  • also offer a dedicated synthetic data generator that can be used to construct multi-aspect input documents for evaluation purposes
  • For MRAG, each multi-aspect embedding consists of h single-aspect embeddings ,each pointing to the original text chunk, resulting in data store containing h embedding spaces, each capturing a different aspect of the text.

b) Query execution

  • first generate a multi-aspect embedding of the input query , using the selected embedding model
  • Then, find the nearest multi-aspect embeddings and their corresponding text chunks in the data store using a special multi-aspect retrieval strategy
  • Finally, the retrieved data can optionally be assessed with novel metrics regarding how well it corresponds to the multi-aspect requirements
  • this stage is flexible just like data preparation, allowing user to plug in their models of choice and use their own queries

ii) Constructing Multi-Aspect Embeddings

  • MRAG can leverage any embedding model with multi-head attention support to construct the multi-aspect embeddings for a given input text
  • two embedding models from the MTEB leaderboard [2], namely SFR-Embedding-Model [3] and the e5-mistral-7b-instruct[4] used in this paper.
  • as determined from experiments, multi-aspect embeddings extracted from the last multi-head attention worked best in experimental setting

iii) Retrieval Strategies for Multi-Aspect Data

MRAG retrieval strategy consists of three steps:

a) importance scores assignment

  • First, during data preparation, importance scores are assigned to all h embedding spaces, capturing the fact that different spaces (and the corresponding heads) may be more or less relevant for the used data

b) Getting closest text chunks

  • Then, during query execution, MRAG starts by applying the traditional RAG retrieval separately for each embedding space, returning a list of c closest text chunks for each embedding space (a total of h lists)
  • use a special voting strategy to pick overall top k out of all hc chunks, using the pre-computed importance scores

c) Integration with Data Stores

  • MRAG can be seamlessly used with different classes of data stores and nearest neighbor (NN) search approaches.
  • It can be combined with both the exact and the approximate NN to find the matching (embedding, chunk)-pairs.

Algorithm below details the construction of importance scores:

  • It is a heuristic based on extensive empirical evaluation; it gives high-quality results across the tested datasets and tasks
  • score si of a given head hi consists of two parts, ai and bi. ai is the average of L2 norms of all embeddings in the vector space i; it represents how important a given head is: the largerthe norms, the more attention was given to this attention head. bi is the average of cosine distances between all (or a randomly sampled subset, if the user wants to reduce pre-compute time) embeddings in vector space i
  • bi is a proxy for measuring the “spread” of vector space i: the larger bi, the larger the average angle between different embeddings in this space is
  • Deriving si as a product ai · bi ensures that we reward heads with high average attention and high average spread, but simultaneously penalize heads with lower average attention or with low average spread (both ai and bi are appropriately scaled

Used voting strategy

  • combines the constructed lists of text chunks from individual embedding spaces into a single list of top k chunks
  • Algorithm used is outlined below
  • Each text chunk from a list i of the vector space i has a certain position on this list, we denote this position with p.
  • We obtain a weight for this chunk as si · 2−p; si is the previously defined importance score of the space i.
  • Multiplying si with 2−p exponentially lowers the significance of less relevant text chunks. Finally, all chunks from all lists are sorted using their weights and the top k chunks form the final list.

i) Multi-Aspect Dataset Generation

  • selected conceptually different categories of documents
  • primarily focused on publicly available Wikipedia articles and select 25 categories, with 50 documents sampled from each category
  • enforced that each overview must have at least 800 characters, matching commonly used chunk sizes in RAG schemes

ii) Multi-Aspect Query Generation

  • requires queries that touch upon a given number of n aspects. For example, a query with 10 aspects must contain a question about 10 different documents from 10 different categories
  • created such queries by selecting n categories, sampling a document from each selected category (ensuring there are no duplicates overall), and then generating a story that combines these documents, using an LLM (GPT-3.5 Turbo)
  • constructed 25 queries with 1, 5,10, 15 and 20 aspects (125 queries in total)
  • An example multi-aspect query sent to the LLM that requires retrieving 10 documents from 10 different categories, is pictured in top part of figure below

iii) Metrics

  • For a query Q, a used retrieval strategy S , and n documents from n categories to retrieve, Qrel denotes the ideal set of documents that should be retrieved for Q.Then, S(Q, n) is the set of the actually retrieved documents.
  • defined the Retrieval Success Ratio as a metric denoting the ratio of successfully retrieved relevant documents
  • For the case, when a RAG scheme does not retrieve the exact desired document, but it still retrieves successfully some other document from the same category, defined another measure called Category Retrieval Success Ratio. It is same as metric mentioned above, with one difference: S(Q, n) is now the set of all the retrieved documents that belong to categories of the ideal desired documents
  • Finally those two metrics are combined, with Weighted Retrieval Success Ratio. By varying w, the user can adjust the importance of exact document matches and category matches

i) Comparison Baselines

  • two main baselines: Standard RAG and Split RAG
  • Standard RAG represents a modern RAG pipeline in which each document uses the activations of the last decoder layer as its embedding
  • Split RAG is a blend between Standard RAG and MRAG, as specifically it splits the activation of the last decoder layer in the same way as MRAG and applies a voting strategy
  • purpose of Split RAG is to show that MRAG’s benefits come from using the multi-head output as embedding and not merely using multiple embedding spaces
  • Fusion RAG [29], is considered additionally as an optional mechanism that we harness to further enhance the benefits of MRAG at the cost of additional tokens

ii) Samples & Summaries

  • Boxplots below shows Retrieval success ratio over 25 queries between MRAG and Standard RAG, where each query includes 10 different aspects

iii) Results Analysis

  • Results above shows that MRAG consistently outperforms Standard RAG (> 10% increase in the retrieval success ratio on average for exact document matches).
  • Moreover, the retrieval performance increase is even more significant on category matches (> 25% increase in the retrieval success ratio on average)
  • for a specific number of documents fetched, MRAG’s histogram indicates a better distribution of retrieval success ratios (across all 25 queries).
  • Figure below shows the relative weighted performance improvement of MRAG with respect to Standard RAG as we vary the number of aspects present in the queries
  • MRAG consistently outperforms the Standard RAG by 10–20% on average, not only across the number of documents fetched, but also across the number of aspects present in the replies, for both models.
  • Table below shows Retrieval success ratio (the exact document match) for 25 queries with a single aspect
  • Table above shows that MRAG performs on-par with Standard RAG on queries from our multi-aspect dataset where only a single aspect is expected

iv) Further Improvements with Additional Tokens

  • combined MRAG with Fusion RAG, representing RAG schemes that use an LLM (additional token cost) for more accurate retrieval.
  • Fusion RAG uses an LLM to create a fixed number of questions about the RAG query. Each question is separately applied through an embedding model using Standard RAG
  • MRAG’s approach applied to each of these questions and denote the combined scheme as Fusion MRAG
  • Figure below shows Relative retrieval improvements of MRAG over Standard RAG for the SFR embedding model compared with Split RAG (the blue plots), and the relative retrieval improvements of Fusion MRAG over both Fusion RAG and MRAG (the red plots).
  • both Fusion RAG and Fusion MRAG perform better than Standard RAG, on average gaining 10 to 30% in accuracy

v) Real-World Workloads

  • considered two real-word use cases from in-house industry data analytics projects, namely, the synthesis of legal documents and the analysis of causes of chemical plant accidents
  • Figure below shows Average improvement of the retrieval success ratio of MRAG and Split RAG over Standard RAG for two real-world workloads constructing legal documents (left) and discovering causes of industry accidents (right).
  • As shown above for retrieval success ratio over corresponding databases, MRAG offers advantages over other schemes
  • proposes Multi-Head RAG (MRAG), a novel scheme that leverages the activations from the multi-head attention layer of decoder models instead of the traditional feed-forward layer
  • comprehensive evaluation methodology, including specific metrics, synthetic datasets, and real-world use cases, demonstrates MRAG’s effectiveness
  • results indicate a significant improvement in the relevance of retrieved documents, with up to 20% better performance compared to modern RAG baselines
  • MRAG proves to be both cost-effective and energy-efficient. It does not require additional LLM queries, multiple model instances, increased storage, or multiple inference passes over the embedding model



Source link

Source link:——large_language_models-5

What do you think?

Leave a Reply

GIPHY App Key not set. Please check settings

Apple announces AI-powered iPhone call recordings at WWDC 2024

Apple reveals AI-enhanced iPhone call recording feature at WWDC 2024 #privacy

Large language models make radiology reports more patient-friendly

Large language models improve radiology reports for patients.