How Width.ai Builds In-Domain Conversational Systems using Ability Trained LLMs and Retrieval Augmented Generation (RAG)

February 29, 2024

Chatbots with access to company specific data or company defined knowledge are continuing to grow in popularity, as users are finding out that popular models like ChatGPT and BARD are only trained up to a specific date and don’t naturally have access to the internet or your database. I’m sure you’ve gotten messages like this one:

‍

Example of the issues with real time data in ChatGPT

‍

While this isn’t a surprise given the way these models are trained or even that much of an issue, the real pressing issue for companies is how they can attach their specific data and knowledge to one of these models to take advantage of the ability of these LLMs, not as much the information they are trained one. We internally call this knowledge driven training vs ability driven training.

‍

For these LLMs to really be useful to companies they need access to real time data and knowledge, and have an understanding of in-domain and industry specific concepts and outputs.

‍

What Is Retrieval-Augmented Generation?

LLMs are trained on large amounts of text from websites, books, research papers, and other such public content. The semantics and real-world knowledge present in all that text are dispersed across the weights of the LLM's neural network. This representation as network weights is called parametric memory.

But what if you want accurate answers based on your own private documents and data? Since they weren't included in the LLM training, you aren't likely to get the answers you want.

One option is to convert your private documents and data into a private dataset and finetune the LLM on it. But the problem is that you'll have to finetune the LLM frequently if new documents and data are being created all the time.

So you need an approach that can dynamically supply information to your LLM on demand. Retrieval-augmented generation is a solution that retrieves relevant information from your documents and data to supply to your LLM.

The Modern High Level RAG Pipeline

As of 2023, RAG is often implemented as a pipeline of separate software components, like the one shown above. The components include:

Knowledge base: The knowledge base (KB) is the information from which relevant sections are retrieved and provided to the LLM as inputs. Unlike the internal knowledge dispersed throughout the LLM's weights (parametric memory), the KB acts like external, non-parametric memory that people can easily read, modify, and replace without retraining the LLM.
Data sources: Your KB can consist of a variety of data sources and formats, like documents, relational databases, application programming interfaces (APIs), or web content.
Large language model: The LLM acts based on the details provided in a prompt which includes a primary task or question, contextual information, and optional few-shot examples to demonstrate desired results. Any of these details may come from the KB.
Embeddings: Embeddings are vector representations of the information in the KB. They're preferred over raw formats because the relevance of information searches is much better with vectors. To create embeddings for some text, we use pre-trained or finetuned embedding models — like SentenceTransformers, RoBERTa, or T5 — or cloud-based embedding APIs like OpenAI's embeddings endpoint.
Vector database: Depending on the size of your KB, vectorization can take a long time and produce millions of embeddings that are difficult to search quickly. A vector database alleviates this by saving the embeddings on storage media for later reuse and efficiently searching millions of embeddings with fast similarity algorithms.

A typical modern RAG workflow with these components operates as follows:

Vectorize your KB: All the information in your KB is vectorized using an embedding model or API, and the generated embeddings are stored in the vector database. The embeddings may cover multiple levels of information organization, like sentences, paragraphs, sections, chapters, entire documents, database rows, subsets of database columns, and so on.
Set up your LLM: Typically, you just use a pre-trained LLM like GPT-4 or Llama 2. But for some specialized domains or knowledge bases, you may have to fine-tune your LLM. For example, for generating structured formats like JSON, you should strengthen the LLM's knowledge of those formats because they're considerably different from natural language. Or, if you want a chatbot capable of multi-turn dialogue, you should fine-tune the LLM using reinforcement learning from human feedback.
Receive prompts: With the embeddings and LLM ready, your RAG pipeline can start responding to prompts from users or other systems.
Generate queries: More modern RAG systems generate search queries based on the provided user query. In chatbots, this also allows you to to create search queries based on the entire conversation.
Create embeddings for the search queries: For each query, vectorize it using the same embedding model or API.
Retrieve information relevant to the prompts: Pass the prompt's embedding to the vector database. Using vector similarity algorithms, it finds information in the KB whose embeddings are semantically relevant to the prompt. The information may be useful contextual details, better task descriptions, detailed system prompts, or relevant few-shot examples.
Augment the prompts with the retrieved information: The retrieved information is combined with the original prompt according to LLM-specific prompt engineering rules. Sometimes, it may even replace the original prompt entirely. You may also augment system prompts this way.
Send the prompts to the LLM: Each augmented prompt — consisting of a task and optional contextual information, few-shot examples, or system prompt — is sent to the LLM for processing. Parameters like the temperature, probability threshold, or the top K count may vary depending on the prompt, the domain, and even aspects of the retrieved information. If you're confident of the retrieved context's relevance and factuality, you might want the temperature low to make the responses more extractive and less hallucinatory.
LLM generates responses: The LLM generates one or more responses for each prompt. You can optionally evaluate and rerank them.
Use responses as additional context: You can include a generated response in the next prompt. You can also retrieve additional information from your KB for it, leading to increasingly detailed information for the LLM after each step.

The architecture and workflow above have become more popular because they are convenient to deploy and perform well at scale at the cost of slightly lower relevance. In the following sections, we delve deeper into aspects of this pipeline.

Retrieve Knowledge, Prompts, or Both?

An LLM's response is guided by multiple pieces of information in the prompt like:

The primary task or question
The system prompt that guides the LLM's behavior, tone, and personality
Contextual information and its relevance to the task or question
Few-shot examples to demonstrate desired processing or results

Conventional RAG focuses only on retrieving contextual information that is relevant to the task.

However, there's nothing special about context. You can use the same retrieval approach to select relevant tasks, questions, system prompts, or few-shot examples. We demonstrate the need for this and the outcomes in the sections below.

Retrieval for Primary Tasks and Questions

All promptable LLMs and multi-modal models like DALL-E or Stable Diffusion are sensitive to the structures and semantics of prompts. That's why there are so many prompt engineering tips and tricks in circulation.

RAG can help improve and standardize the prompts by maintaining a knowledge base of predefined task prompts that are known to work well for the selected LLM. Instead of forcing users or systems to send well-formed prompts, use RAG to select predefined task prompts that are semantically similar to the requested tasks but work better.

For example, for extractive summarization using GPT-4, a prompt like "select N sentences that convey the main ideas of the passage" works better than something like "generate an extractive summary for the passage."

Retrieval for System Prompts

System prompts guide the LLM's behavior during a conversation, tone, word choices, or response structure. Some examples are shown below:

Given a prompt, you can use RAG to retrieve relevant system prompts. This is useful for guiding general-purpose LLM systems that can carry out a variety of tasks. However, since system prompts are unlikely to be semantically close to user prompts, you can:

First, classify each user prompt under a prompt category
Then retrieve system prompts that are semantically related to that prompt category

You can also use this to scale up to a multi-customer system where customers can have their own “custom” RAG pipeline. They get their own prompting system for the use case, with custom data already provided via the database filtering and indexing in vector DBs.

Retrieval of Few-Shot Examples

Another common use of RAG is to retrieve few-shot examples that are relevant to the prompt. Few-shot examples demonstrate to the LLM how it should modify various inputs to desired outputs. This is one of the best use cases of RAG in chatbot systems. Quality chatbot systems store successful conversations as a way to guide the LLM towards other successful conversations. This is probably the best way to ensure things like tone,style, and length are followed in chatbot systems. These conversations can be given a goal state to reach and our system can store just those conversations that reach that point.

Width.ai chatbot framework using RAG — Our chatbot framework that stores messages for use as few-shot examples.

Retrieval From Different Data Sources

How we retrieve our relevant data, and what the system looks like to decide what systems we access to retrieve data, are the most important parts of RAG systems. This is usually the most upstream system in our architecture, and all downstream operations are affected by what we do at this step. It’s critical we pull context from the proper sources to even have a chance to answer the query!

Chunking for Documents

Chunking breaks up the information in your KB into fragments for vectorization. Ideally, you don't want to lose important information or context while doing so. Chunking is required in most systems as LLMs have context limits. Additionally, Liu et al. showed that the accuracies of LLM-generated answers degrade with longer context lengths and when relevant information is in the middle of long contexts.

‍

Some chunking techniques are:

Naive limit-based chunking: You just split the information based on the token limit, ignoring any kind of syntactic or semantic consistency.
Overlapped chunking: This is slightly better than naive chunking because you keep some context from the previous and next chunks by overlapping some of their starting and ending tokens.
Chunking based on document structure: The chunks here follow the inherent structure of your documents. For example, structural elements like paragraphs, sections, chapters, tables, lists, and code blocks are treated as separate chunks.
Hierarchical chunking using summaries: Another context-preserving technique is similar to overlapped chunking but instead of the original tokens, a summary of the previous chunk is included in the current chunk.

‍

Retrieval Using Actions

Many LLMs now support actions through techniques like OpenAI function calling, ReAct prompting, or Toolformer. Actions are like callbacks issued by an LLM whenever its semantic state matches the criteria set by the callback function's metadata.

You can retrieve relevant information when an action is initiated by the LLM. The benefit here is that the action may carry better information for a similarity search than the original prompt. This is particularly useful for multi-turn dialogue environments like customer service chatbots.

Embeddings Design

In this section, we explore some aspects of embeddings that influence the quality of information retrieval.

Embedding Length, Corpus Variety, and Model Size

These three factors influence the quality of semantic similarity matching.

Generally, longer embeddings are better because they can embed more context from the information they were trained on. So the 1,536-dimensional embeddings produced by the OpenAI API are likely to be qualitatively better than the 768-dimensional embeddings from a BERT model.

The volume of data that a model is trained on also matters. The OpenAI models or the T5 model are known to have trained on far more variety than BERT. Since larger models hold more information in their parameters, their embeddings too are of better quality.

Lastly, the nature of the training data influences the quality of the embeddings. A specialized model like Med-BERT may produce better embeddings for medical tasks than general embeddings even when the embeddings are shorter.

Symmetric and Asymmetric Similarity

When the lengths of the query and the retrieved information are of similar orders of magnitude, the similarity is said to be symmetric. For example, if the information being retrieved is predefined system prompts, you can treat it as a symmetric similarity problem.

On the other hand, if the query is orders of magnitude shorter than the retrieved information, it's an asymmetric similarity problem. For example, question-answering based on documents is likely to be asymmetric because the questions are typically short while the answers from documents are much longer.

Intuitively, we realize that symmetric and asymmetric similarity models must be trained differently. For symmetric similarity, since there is roughly the same volume of information in a query and a document, an embedding can hold roughly equal context from each. But for asymmetric similarity, an embedding should store far more context from the document than the query so that it can match embeddings from other long documents.

SentenceTransformer Models

The SentenceTransformers framework provides a large number of embedding models for both symmetric and asymmetric searching. These models differ in their model architectures, model sizes, relevance quality, performance, training corpora, language capabilities, training approaches, and more.

Llama 2 Embeddings

An open-source LLM like the 70-billion-parameter model of Llama 2 is far more versatile and powerful than anything available with SentenceTransformers. Its embeddings will yield much better quality.

OpenAI Embeddings

OpenAI embeddings API is another good choice to generate your embeddings. Since it's a managed and metered API, performance will be slow and incur expenses over time. However, it's a good choice if your knowledge base is small.

Vector Database Selection

When selecting a vector database for your RAG pipeline, keep the following aspects in mind.

In-Memory, Standalone, and Managed Databases

There are multiple ways to deploy a vector database. Some run as components inside your application process and are thus limited by the system's memory. Some can be deployed as standalone distributed processes. Others are managed databases with APIs.

You must select a database that is suited to the scale and quality of service you need for your RAG workflows:

If you're deploying RAG in production and require high scalability for the volume of information or users, choose a managed database service like Pinecone, Weaviate Cloud Service, or Qdrant Cloud.
If you prefer self-managed solutions, go for open-source, production-ready, standalone, distributed, highly scalable databases like Weaviate, Mlivus, or Qdrant.
If you're just prototyping RAG or implementing it as a minor component and don't expect a high scale of information or users, a simple in-memory database like Chroma or FAISS will be best.

Similarity Algorithms

All vector databases implement some kind of approximate nearest neighbor algorithm and suitable indexes for similarity search. Select an algorithm and index type according to your use case and the nature of the information. For example:

Milvus supports multiple algorithms and indexes like hierarchical navigable small world (HNSW) and Annoy.
Weaviate supports HNSW and Annoy.
Qdrant only offers HNSW.

Ability Trained LLMs

The idea behind ability trained LLMs vs knowledge trained LLMs focuses on what the goal of the LLM is in RAG. We want to use it to understand and contextualize information provided to the model for generation, not pull information from its training for generation. These are two very different use cases, and two very different prompting structures. Here’s a better way to think about it.

All understanding of the task and what information is available to perform the task is based on what is provided. This way the model only generates responses based on this specific information, and none of its underlying knowledge, which can lead to hallucinations.This generally requires some level of extraction from the model to understand what is relevant. Although you might not actually perform an extraction step, the model has to do this with larger context.

This is what it looks like when we rely on the LLM for the knowledge used to answer the query. Everything is focused on the prompt and the knowledge the model is trained on.

This means the key focus of the LLM in RAG is understanding the context provided and how it correlates with the query. What that means for fine-tuning LLMs is the focus should be on improving the LLMs ability to extract and understand provided context, not fine-tuning the LLM to improve its knowledge. This is how we best improve RAG systems by minimizing the data variance that causes hallucinations or poor responses. Our LLM better understands how to handle context from multiple sources and sizes which becomes more common as these systems move to production use cases. This means we can spend less time trying to over optimize chunking algorithms and preprocessing to fit a specific data variance as our model is better at understanding the inputs and how they correlate to a goal state output.

Frameworks for RAG

You can use these frameworks to simplify your RAG pipeline implementation.

LangChain

LangChain implements all the components you need for RAG:

Vector stores: LangChain supports most of the vector store implementations out there, including Milvus, Weaviate, Qdrant, Pinecone, Chroma, and FAISS.
Data sources and retrieval: It supports retrieval from a variety of data sources and datasets like Google Drive, PubMed, and more.
Document loaders: LangChain's document loader framework can handle a variety of data formats and APIs like S3, PDF, HTML, and more.
Embedding models: LangChain supports all the popular embedding models and APIs like SentenceTransformers, Hugging Face models, and OpenAI.
LLMs: LangChain supports LLM and chat models and APIs like Hugging Face models, OpenAI GPT, and Llama.

LlamaIndex

LlamaIndex is a framework for implementing RAG using your private or domain-specific data. It too supports all the components you need for RAG, including:

We really like LlamaIndex and the way they’re navigating the RAG space. One of the key things we look for in these frameworks is the ability to add a high level of customization. Production RAG implementations always need customization (regardless of what you’ve heard) and LlamaIndex currently supports the most customization and integration of various systems. They are also moving very fast in the space and constantly pushing out new updates to support how high level companies use RAG.

The Original End-to-End Differentiable Approach

The original RAG approach, proposed in 2020 by Lewis et al., worked differently from the pipeline described so far.

A key difference was that the entire RAG was a single complex model where every step was a differentiable part of the whole, including the vectorization and similarity search steps as shown below.

*The differentiable RAG model (Source:* *Lewis et al*.)

Differentiability meant that you could train the entire model on a language dataset and knowledge base. The weight adjustments to minimize the final loss could then be backpropagated to every step of the model. This effectively created custom functions for vectorization, retrieval, and text generation that were highly fine-tuned for relevance.

The benefits of this approach are:

High relevance
End-to-end trainability
Replaceable knowledge base

The cons are:

Limited knowledge base due to computation and memory constraints
Requires fine-tuning

You can still choose this approach if you have a compact knowledge base and a limited set of domain-specific tasks. Its quality of relevance can't be easily matched by the general pipeline approach.

Case Study: Document Summarization

Width.ai document summarization pipeline

We use the above GPT-based RAG pipeline for our document summarization services. Relevant prompts and few-shot examples are retrieved at runtime based on the provided document and summarization goal.

Case Study: Customer Service Chatbot

We also use RAG in our customer service chatbots for banking clients. Based on customer queries, relevant answers and few-shot examples are retrieved from our vector database during action invocations.

Ready to build your own RAG System?

In this article, you studied various design and implementation aspects of RAG. At Width, we have implemented and deployed RAG in production for banking clients and law firms where retrieving the latest information is an absolute necessity. If you want to streamline your workflows using LLMs on your company's private documents and data, contact us!

References

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401 [cs.CL]. https://arxiv.org/abs/2005.11401

Lets Talk