Building a Chatbot: Super-Helpful Personalized Customer Service Using Llama 2

Matt Payne
October 31, 2023

Generic and “chatbot sounding” chatbots come with the risk of damaging your customer retention, especially if it's pretty easy to tell you’re talking to a bot that has limited knowledge and ability. The number one limiting factor customers come to us with in terms of building chatbots is the worry that they won’t be able to take on the persona of their current customer service.

However, modern chatbot frameworks powered by large language models (LLMs) & elite prompting language provide opportunities to deliver high-quality customer service that interact just like customer service agents, because they’re trained just like customer service agents.

In this article, we explore building a chatbot that's powered by open-source LLMs to provide highly relevant and even personalized customer service. This chatbot has an exact gameplan built into it for how to talk to customers and overcome deflections, pain points, and generic responses from customers. It also has access to real time customer data and accounts to fully engage customers and figure out solutions to customer service problems.

What Customer Service Problems Are We Trying to Solve?

Most typical customer service chatbots result in weaknesses both in the customer experiences they provide to customers and the business value they generate for businesses.

Some weaknesses in the customer chatbot experience include:

  • Providing very generic answers - these usually come from bucket based systems that fall into a poor conversation loop that is impossible to get out of.
  • Supporting only a limited set of questions - again this usually comes from either a buckeye based chatbot system or limited prompting. Customers can fall down a similar loop that they quickly click out of.
  • Inability to factor in all the details provided by a customer in their questions, sometimes ignoring even key details - This can happen when models are not well trained on follow up questions or full conversation sequences. The model tries to get right to some level of feedback to the user without fully understanding the problem. This is common in chats, as customers usually start with much higher level queries. I explained it a bit in this chatbot video that focuses on filtering down to specific products in an ecommerce store.
  • Not using dynamic information or customer-specific details - Simple chatbots have a limited ability to leverage the information they have access to relative to dynamic information provided by the user. Poorly written user inputs that are missing information, wrongly input, or just too generic can lead to the chatbot not knowing how to perform operations that it is set up to do.
  • Supporting a limited number of languages

These weaknesses force many customers to quickly switch to human service agents, or simply exit the service.

But if chatbots can provide far more helpful and personalized customer service, both customer satisfaction and business value can go up. Chatbots powered by large language models (LLMs) hold some promise in their ability to achieve this.

We explore such an LLM-powered chatbot system architecture and its step-by-step implementation in the sections below.

Customer Service Chatbot System Architecture

The overall system architecture of an LLM-powered customer service chatbot is shown below:

Width.ai Chatbot Architecture
Width.ai Chatbot Architecture

We'll explore each component in later sections. But at a high level, the chatbot application programming interface (API) service consists of a chatbot pipeline that generates replies to customer queries. To do so, it uses knowledge bases and external data sources to enhance its knowledge and the amount of information it knows about.

You can provide this chatbot service to customers through multiple channels including text, voice, and text-to-speech as shown below:

Question-Answering Chatbot Pipeline

The LLM-powered chatbot pipeline is the heart of this system. Its question-answering (QnA) architecture is shown below:

Response generation with chatbot

The three LLM QnA components correspond to the three sources of information that are helpful to customers:

  • General frequently asked questions (FAQs)
  • More detailed knowledge documents containing details about products and services
  • Dynamic information that either changes frequently or is customer-specific

Intent recognition routes each customer query to the most suitable chatbot pipeline. In the next section, we give some details about the LLM we're using for this implementation.

Selecting an Open-Source LLM

As of August 2023, Meta's Llama 2 has emerged as one of the most versatile and popular LLMs among the open-source ones with ChatGPT-level quality.

It's also one of the few open-source ones that have been fine-tuned using both the common supervised fine-tuning as well as the less common (and more effort-intensive) reinforced learning with human feedback methods. As a result, the quality and relevance of its chat responses are good right out of the box.

Llama 2 training workflow
Llama 2 training workflow (Source: Meta AI)

Plus, unlike many other open-source LLMs, its permissive license attracts businesses to use it for commercial purposes.

An example chat with a pretrained Llama 2 model is shown below:

Example response from stock Llama 2 chat model
Example response from stock Llama 2 chat model (Source: llama2.ai)

We'll start by further fine-tuning this LLM on base knowledge and the ability to extract information from the provided context.

Fine-Tune the LLM on Your FAQs

FAQs, like the banking examples below, can clarify many common doubts in the minds of your customers.

Customer FAQs from a popular bank

But a list of FAQs on your website can make for a poor customer experience for many reasons:

  • Lack of time or patience: Your customers may not have the time or patience to scan through your FAQs looking for something close to their information needs. As your list of FAQs grows over time, this problem becomes worse.
  • Doesn't satisfy complex information needs: Your FAQs may independently address various customer situations well but not their combinations. For example, a loan FAQ may have one answer for customers whose pay is below a threshold and another for customers above a certain age threshold. These answers may even conflict with each other in some aspects, confusing any customer who satisfies both conditions.
  • Difficult to search: Most websites don't provide semantic, or even keyword, search over their FAQs to let a customer specify their information need. Customers have to make do with looking for exact text matches using their browser text search. In businesses that use a lot of jargon, the chances of finding useful information through exact text matches are rather low.

Why Fine-Tune LLMs?

LLMs are a great solution to these problems. Since they can semantically understand complex information needs and combine multiple answers on the fly, you can alleviate all three problems by training an LLM-powered chatbot on your FAQs.

Specifically, you can bake in the FAQ knowledge into the LLM's internals using fine-tuning techniques. There are many approaches and third-party services to help you fine-tune an LLM.

Fine-Tuning Using MosaicML

In the steps below, we demonstrate fine-tuning your LLM chatbot using a service called MosaicML.

1. Training Data

Transfer your FAQ training data to AWS S3, Azure, or a similar S3-compatible provider.

2. Create a Script to Modify Your Data

MosaicML expects the training data to be set up using its Streaming framework. Provide the S3 bucket URL to the StreamingDataset object.

importingthe dataloader
(Source: MosaicML)

3. Create a Docker Image for Running Llama 2

Since MosaicML orchestrates the training using Docker, create a container to host the Llama 2 model and run it using the Hugging Face transformers framework.

4. Create a MosaicML Configuration for Llama 2

Create a configuration file similar to the one shown below, changing the details to match your environment and Docker images.

MosaicML configuration file
MosaicML configuration file (Source: MosaicML)

Infrastructure provisioning is as simple as two or three lines in that file. How many GPUs or TPUs do you need? And what type? MosaicML handles the rest behind the scenes.

5. Start the Fine-Tuning Run

Use the MosaicML command-line utility to start the fine-tuning run:

command to start the fine-tuning run

6. Monitor the Fine-Tuning

Use the same utility to monitor your runs:

7. Access the Trained Model

Once the run ends, MosaicML uploads the trained model to the Hugging Face Hub or another repository.

Results From the Fine-Tuned Chatbot

How much difference does fine-tuning make? Below is the reply to a banking-related query from a stock Llama 2 seven-billion-parameter chatbot:

Banking-related reply from stock Llama 2 chatbot
Banking-related reply from stock Llama 2 chatbot

We fine-tune the model on a bank's small FAQ (about 250 questions) using this strategy:

  • Model selection: For FAQs to wield more influence over the LLM's predictions, you need a large dataset for a large model. If you have a small dataset that you can't  realistically expand through data augmentation, it's better to fine-tune a small model instead. Since we have a small dataset with just 250 questions here, we fine-tuned the Llama 2 seven-billion-parameters chat model.
  • System prompt design: We used the system prompt to ensure that our FAQs are prioritized over general information. We did this by including the bank's name as follows: "You are a helpful chatbot for customers of [COMPANY]." How does that help? That's explained in the next point. We suggest you include your business name similarly in the system prompt.
  • Dataset relevance tuning: The system prompt is just one half of this strategy. The other half lies in modifying your FAQ. For your business-specific name and details to influence the predictions, you should include your business name in the FAQ. We did this by prefixing every question and answer in the FAQ with "In [COMPANY],..."

With such careful fine-tuning, the reply for the same question is shown below:

Fine-tuned chatbot with carefully engineered dataset and system prompt
Fine-tuned chatbot with carefully engineered dataset and system prompt

The chatbot reply is now more specific and contains relevant details from the FAQ. Plus, the chain-of-thought instruction in the system prompt tells the LLM to combine information from multiple FAQs when answering complex questions.

Retrieval-Augmented Generation Using Knowledge Documents

FAQs are just one type of business knowledge. Your business probably has many other types of knowledge documents that straddle the line between static and dynamic information.

For example, loan or insurance products are accompanied by policy and offer documents that contain too many details for the average FAQ page, such as this sample insurance document:

Sample insurance document's section

Information from such knowledge documents is often very relevant for answering customer queries meaningfully. In the following sections, we explain how to incorporate their information into LLM chatbots using a technique called retrieval-augmented generation (RAG).

Knowledge Database and Retrieval

The key principle behind RAG is to find fragments from your knowledge documents that are semantically related to a customer's query. These fragments contain the relevant information that the LLM can use as context to meaningfully answer the customer's question.

To implement RAG, you need these three components:

offline data upload for RAG
  1. Vector database (DB)
  2. Embedding service
  3. Knowledge management

Let's understand each of these components a bit more.

Vector Database for Document Embeddings

The vector DB is used to store and look up millions of embedding vectors generated from your knowledge documents. A vector may be for an entire document or each chapter, page, paragraph, or even sentence.

The vector database can be a self-hosted database like Qdrant or Weaviate or a managed third-party service with an API, like Pinecone.

Embedding Service

The embedding service has two functions:

  1. Document embeddings: Calculate embeddings for every document or document fragment and store them in the vector database.
  2. Similarity search: When a customer types a query, this service calculates an embedding for the query and looks up semantically relevant documents in the vector DB.

We use the SentenceTransformers framework for calculating and matching embeddings, specifically its asymmetric semantic search models. That's because user queries are typically much shorter than document fragments, requiring asymmetric rather than symmetric matching.

Knowledge Management

Finally, you need a component that allows your business operations teams to modify (add, update, or delete) your knowledge documents. When a document changes, this component modifies its related embeddings in the vector DB using the embedding service.

Personalization & Real-Time Ability Using Dynamic Information

Documents are considered non-dynamic data as they do not change frequently and do not contain information about ability or changing information.

We bring in dynamic data such as inventory data, external API services, constantly changing prices, and other key information we want to be able to answer questions about and use in our conversation, but changes too frequently to store in a vector database like documents. If we were to try to vector index inventory data we would have to update the database and reindex every single time that data changed. It makes more sense to store data like product specifications and product tags in the vector DB and then access an API for inventory and pricing information.

This also allows us to add a level of personalization to the chatbots ability by pulling in customer specific data from their account. This can help guide our responses and provide questions that better reach our end goal of the conversation. This ability to skip some questions and steps improves retention.

Business-Specific Information Elicitation

A good chatbot should aim to identify every customer's exact information needs and guide them toward it one step at a time by asking for specific details. This is like chain-of-thought prompting but with more awareness about your industry and specific business.

For example, a mortgage business' chatbot must elicit key details like the loan amount and term that a customer is looking for. With those details, it must fetch dynamic information like the customer's credit score and suitable interest rate, calculate the monthly payment, and convey all these details as shown below.

information management

Or take the example of an e-commerce chatbot. It must guide a customer into revealing the kind of product and product characteristics they have in mind to find a close match. The choices at each step are dynamic because they're based on the customer's previous choices as shown below.

For such elicitation of a customer's exact need, your chatbot must issue the relevant commands and fields expected by the API services that manage your dynamic information. Some popular approaches to implement this are:

  • ReAct prompting
  • ToolFormer architecture

We have demonstrated ReAct prompting’s ability in chatbots elsewhere. In this article, we demonstrate the Toolformer approach.

Call External APIs With Toolformer

Toolformer is a technique to equip LLMs with the ability to invoke external tools by issuing API-like action commands and including their results in the generated text. I love toolformer for chatbots as the ability to place the dynamic text inline with model generated text makes it very easy to blend the response. Many approaches require you to pull the dynamic knowledge into the prompt then generate everything in one go. This lets our model focus on the conversational text and just create simple tags when we need dynamic data. These examples illustrate the idea, with the tool actions and results highlighted:

Toolformer tool invocations embedded in the LLM's generated text
Toolformer tool invocations embedded in the LLM's generated text (Source: Schick et al.)

In the example above:

  • "QA" is a question-answering tool.
  • "Calculator" is an arithmetic expression calculator.
  • "MT" is a machine translator of words into English.
  • "WikiSearch" returns results from Wikipedia.
  • Each tool's inputs must be expressed as text sequences. The text in parentheses after the tool names are their respective inputs.
  • Each tool must output a text sequence. The text that follows "→" is generated by those tools and seamlessly included by the LLM in its generated text stream.

A critical detail to understand about the Toolformer approach is that the LLM itself probabilistically decides when to insert a tool action, which action to insert, and what inputs to give it. The application that receives the LLM's output (a chatbot in this article) is responsible for detecting the tool actions in the text and accordingly executing them.

Toolformer Training and Fine-Tuning

Equipping an LLM with Toolformer involves both in-context learning and fine-tuning as shown below and explained in the next sections.

Toolformer training workflow
Toolformer training workflow (Source: Schick et al.)

1. In-Context Learning

The first stage of Toolformer training involves in-context learning (i.e., generating prompts containing tool action examples). For each tool required by your chatbot, a suitable prompt is provided along with a few in-context input-output pairs as examples.

A typical tool action prompt with two in-context examples is shown below:

 in-context learning
Toolformer training step 1 — in-context learning (Source: Schick et al.)

The inputs contain a factual statement and the outputs contain tool actions and inputs that are likely to generate the inputs.

For a mortgage business, the examples might resemble these:

2. Evaluation Dataset Generation

The in-context learning prompts above are combined with several samples from a dataset like CCNet. Samples are selected based on some heuristic relevance to an action and then randomly annotated with tool actions at different positions.

3. Tool Executions

For every generated sample above, the training script executes the tool actions it finds in the samples. Running a tool may involve some custom logic (like a loan calculation) or calling an external API (like getting a product price from an inventory API).

The result is a dataset where tool actions have inserted tokens at different random positions.

4. Loss-Based Filtering

Next, the training script evaluates each sample in the dataset, the generated tokens, and the expected tokens to identify whether a tool execution increased the loss function of the subsequent tokens. Any sample that didn't reduce the loss is eliminated.

So you're left with a dataset where tool actions produced expected and semantically correct results.

5. Fine-Tuning

Finally, this remaining minimum-loss dataset is used to fine-tune the LLM to equip it with Toolformer capabilities. The fine-tuning follows the same MosaicML workflow as the FAQ fine-tuning above.

Intent Extraction

The first component to process each customer query is the intent extraction component. Before the query is processed by the LLM-based question-answering pipeline, this component attempts to classify the nature of the query as:

  • General query (suitable for the FAQ-aware LLM)
  • Special query requiring knowledge document retrieval
  • Dynamic query

After determining the nature of the query, it's dispatched to the appropriate pipeline. For the actual classification, this component just uses another LLM instance that's fine-tuned for query classification tasks.


This section provides details on the databases that support your chatbot's operations.

Multi-Tenant Database

If you're a mid-sized business, you probably want not just one but multiple customer chatbots to cover each of your major products or business lines. For example, if you're a bank, you may want a chatbot for retail banking and another for your mortgage business. They can share the same front-end user interface but route the queries to the relevant business line's chatbot in the backend.

Similarly, if you're a software-as-a-service providing an online employee payroll and benefits management service, you'd want to provide a separate chatbot to each client company because their FAQs, knowledge documents, and dynamic information will all be different.

For such multiple chatbot arrangements, you need a multi-tenant relational database where each chatbot's knowledge and interactions are separated by grouping them under a client or business line. Such a database schema is shown below:

This database groups each chatbot's target users, documents, and interactions using a group ID. When a client's website calls the chatbot API, it routes the interaction to the chatbot instance created for that client.

Dynamic Data

Your dynamic data consists of all your existing databases that provide dynamic information like:

  • Current inventory
  • Product details
  • Real-time product availability
  • Current prices
  • Frequently changing information like share or index prices
  • Customer-specific information like their orders and subscriptions

These are modified using proprietary APIs in response to actions generated by the LLM's ToolFormer (or ReAct prompting or equivalent) implementation.

Deployment and Production

In this section, we explain some critical deployment aspects you must take into account when going to production.

How to Deploy Your MosaicML Fine-Tuned LLM

You can deploy the LLM you fine-tuned using MosaicML with these steps.

1. Create a Deployment Configuration

The configuration file tells MosaicML:

  • The model to deploy
  • The infrastructure it requires

An example deployment configuration is shown below (for another model; change it to match Llama 2):

2. Deploy the Model

Use the mcli utility to deploy the model. MosaicML automatically provisions all the resources it needs for inference:

The model is deployed and published at an API endpoint.

3. List Your Active Deployments

MosaicML gives a unique name to each deployed model. You need to know it to send requests to your fine-tuned model. Run this command to list all active deployments:

It'll list your active deployments:

4. Test Inference

Try submitting prompts to your deployed model using code like this:

Docker Containers

The LLMs, databases, chatbot API, and other self-hosted services run as a mesh of microservices in Docker containers.

Authentication and Authorization

For customer-specific dynamic information, the chatbot system must authenticate your customer. We implement this using single sign-on so that chatbot services are available seamlessly to a customer who's already signed in.

Additionally, to prevent unknown applications from accessing your chatbot APIs, authorization checks are enabled for each microservice and API.

Cloud Services

The overall system architecture depends on the following cloud services:

  • Amazon Elastic Compute Cloud (EC2): All the compute infrastructure required for the API services and LLMs are provisioned on Amazon EC2. The chatbot API and embedding service typically require 4 vCPUs and 8 GB RAM. The vector DB alone requires a separate server with a similar configuration.
  • Amazon Relational Database Service (RDS): The relational database containing tenant groups, users, chatbot interactions, and knowledge document details is deployed on Amazon's RDS.
  • MosaicML: The Llama 2 chatbot and other models are trained and published for inference on MosaicML's infrastructure.

Provide High-Quality User Experiences Using Modern Language Technologies

In this article, you saw how to implement an LLM-powered chatbot. A major problem of customer service chatbots was their inability to understand natural language well unlike their human counterparts. With the help of LLMs and their powerful natural language processing capabilities, chatbots can overcome these deficiencies and provide far better customer experiences.

Contact us to find out how we can help you equip your business with high-quality customer service chatbots.


  • Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761 [cs.CL]. https://arxiv.org/abs/2302.04761