Building RAG from Day One: A Framework (Part 1)

Building RAG doesn't have to be hard... or does it?

Sep 03, 2024

Building RAG is hard (?) because it involves multidimensional aspects: both in terms of improving the application from a developer's perspective and navigating around its limitations from a product standpoint. In this series, I've attempted to classify these challenges into a framework, focusing on the trade-offs and the effort required to make informed decisions.

Refreshing the Basics:

To start, let's revisit the standard RAG setup, which might be the simplest approach you'll need:

Document Preparation: Start with your document. If it's lengthy, split it into multiple smaller documents, embed them, and store the embeddings in a vector database.

Image from langchain

Retrieval: When a user query comes in, retrieve the relevant documents from the vector database based on the query.
Generation: Feed the retrieved documents to the LLM (Language Model) to generate the final answer.

Image from langchain

If you're unfamiliar with the term Retrieval-Augmented Generation (RAG), there's a great blog post that provides a clear and detailed introduction: Understanding RAG. This post walks you through the concept, its applications, and why it's a powerful approach in the realm of AI and language models. It's an excellent starting point for anyone looking to grasp the basics of RAG.

Philosophy: Keep It Simple

When I first encountered the term RAG (Retrieval-Augmented Generation), it was exactly what it sounded like—enhancing your retrieval process through generation with LLMs (Large Language Models). In this blog, I'll stick to that philosophy, which I often refer to as: RAG as a Search Engine with LLM.

The title speaks for itself, and our end goal is clear: Improve our search engine as much as possible and feed the results to the LLM to generate the answer.

Acknowledging Challenges:

Given a user query, our goal is to retrieve the correct information and feed it to the LLM to generate an accurate answer.

Document Preparation: We assume that, in this basic setup, there will always be a document available to answer the user's query.
LLM Challenges: In this context, "generate" often means that the LLM can extract or copy relevant text to respond to the query. More complex queries requiring LLM to reasoning based on the document will be addressed in my next blog. As of September 2024, most open-source LLMs with around 2 billion parameters are capable of extracting correct information most of the time.

This brings us to a critical question: How can we best search for and retrieve documents? The success of RAG hinges on the reliability of this search process.

Step 1: Ensuring Reliability

In production environments, reliability is crucial, especially when deploying systems like Retrieval-Augmented Generation (RAG). To build and maintain confidence in your RAG system, it’s important to adopt a metrics-driven approach. RAGAS (Retrieval-Augmented Generation Accuracy Score) is an effective metric for evaluating and enhancing your system's performance.

Metrics-Driven Development (MDD) emphasizes the importance of using data to inform decisions throughout the development process. This involves continuously monitoring key metrics, like RAGAS, to gain insights into how your application performs over time. By focusing on these metrics, you can identify areas for improvement, ensure consistent reliability, and guide the ongoing development of your system. “docs.ragas.io”

For further details on how to implement RAGAS and integrate it into your development process, you can refer to the RAGAS GitHub repository.

Step 2: Build Your Search Engine

In a Retrieval-Augmented Generation (RAG) system, the search engine plays a crucial role as the backbone that retrieves relevant documents based on user queries. Optimizing this process involves setting up a robust pipeline to ensure that the most relevant information is retrieved and ranked effectively.

A fundamental approach to achieve this is through a Retrieval - Re-rank pipeline. This pipeline consists of two primary stages:

Retrieval: The first stage involves retrieving a broad set of documents or passages that are likely to contain relevant information. This step relies on embeddings that represent the semantic content of both queries and documents, allowing the search engine to identify documents that are similar in meaning to the query.
Re-ranking: Once the initial set of documents is retrieved, the next step is to re-rank them to ensure the most relevant ones are prioritized. Re-ranking involves applying a more fine-tuned model or additional criteria to order the documents so that the most contextually appropriate information is at the top.

This basic pipeline ensures that the system not only retrieves relevant documents but also presents them in an order that maximizes the chances of generating accurate and useful responses from the language model.

For a more detailed guide on setting up a Retrieval - Re-rank pipeline, you can refer to the example provided here. This guide offers practical insights and code examples to help you implement and optimize this crucial component of your RAG system.

Enhancing the Pipeline

To truly optimize your Retrieval-Augmented Generation (RAG) system, fine-tuning the models involved is a critical step to ensure that the system performs exceptionally well on your specific data and use cases.

Why Fine-Tuning Matters

While public embedding models like those from OpenAI offer a strong starting point, they are generally trained on diverse datasets and may not fully capture the nuances of your domain-specific data. Fine-tuning allows you to customize these models, making them more attuned to the particular needs of your application. This ensures that the model not only retrieves relevant documents but also ranks them in a way that aligns with your specific priorities.

Fine-Tuning the Embedding Model (Bi-encoder)

The embedding model, often a bi-encoder, is responsible for converting both queries and documents into vector representations. Fine-tuning this model on your domain-specific data helps it better understand the context and subtleties of your queries and documents. By doing so, the retrieval process becomes more accurate, as the model will be better equipped to identify documents that are truly relevant to the queries.

Fine-Tuning the Re-rank Model (Cross-encoder)

After retrieval, the re-rank model, typically a cross-encoder, comes into play. This model compares the query directly with each retrieved document and assigns a relevance score. Fine-tuning the re-rank model ensures that it prioritizes documents in a way that better reflects the needs and priorities of your specific application. This step is crucial for refining the order of retrieved documents, ensuring that the most pertinent information is always presented first.

Is It Good Enough?

The basic Retrieval - Re-rank pipeline, while simple, is highly effective in solving the majority of tasks. This aligns with the 80-20 rule: 80% of your task can be accomplished with this straightforward setup, leaving the remaining 20% for fine-tuning and polishing.

However, the true power of this pipeline lies in your ability to creatively address its limitations and continuously improve it. For instance, even with a well-optimized model, new questions and challenges will arise over time. To adapt, you can incorporate additional documents into your pipeline, ensuring that your system remains robust and capable of handling these new queries.

By continuously refining the pipeline and expanding the document set, you not only enhance the system’s reliability but also ensure that it evolves with the needs of its users. This ongoing process of improvement and adaptation is key to maintaining a high-performing RAG system in the long term.

What’s Next?

Don’t worry, in the next part, I’ll continue to address the limitations of the current framework. I’ll dive into:

Document Preparation: Handling large volumes of documents that can’t be manually managed or prepared is a common challenge. I’ll explore strategies to automate and optimize this process.
LLM Capabilities: To answer more complex questions, sometimes a single document isn’t enough. I’ll discuss how to enable your LLM to retrieve multiple documents to gather insights or go through multiple steps (multi-hop) to form a complete answer.

Be sure to subscribe to stay updated on these advancements. See you in the next part!

Hieu’s Substack

Discussion about this post