Turbocharge Dialogflow Chatbots With LLMs and RAG
Are Dialogflow chatbots still relevant when LLM-based chatbots are available? Yes. We'll show you how you can combine them to get the best of both worlds.
Let’s take a look at what intent classification is in conversational ai and how you can build a GPT-3 intent classification model for conversational ai and chatbot pipelines.
Understanding the intent of a user query into a chatbot is a key part of being able to kick off downstream operations in a dynamic chatbot. These downstream processes can be an embedded vector search, a code process, or anything else that cannot be handled by our actual model used for chat functionality.
Intent classification just means we are using a text classification model trained to recognize the intent of a user and how it correlates with our operations. In non-dynamic chatbots, otherwise known as chatbots where the knowledge can be trained into a model and does not change often (marketing best practices, company documents, generic conversation), this intent classification does not have as much value given we don’t need as many downstream processes.
But for chatbots where we users can kick off downstream processes based on what they say or the action they seem to be asking for, being able to classify these actions is mandatory for production level systems. We need to be able to understand if our base conversational model can handle the user request or query or if a downstream process is supposed to retrieve some information to respond. How does intent classification work and how is it used in a chatbot? Let’s look at an example.
The use case of an ecommerce chatbot that helps users find products they want but aren't totally sure what they’re looking for is a great example of a dynamic chatbot that requires intent classification. Our conversational model can respond to general questions about the store, the company, or any other information that does not change often (non-dynamic). But for data that is constantly changing such as product data, it wouldn’t make sense to train a model on that data, as we’d have to retrain every time the data changes.
Instead what we can do is train a model to classify the intent of the user and respond if our conversational model can handle it, or recognize they’re looking for a product and kick off a downstream process to handle the product search. This works for any number of downstream processes we want to have as options, and just requires us to create more classes.
We’re going to leverage GPT-3 as our natural language understanding model for classifying user inputs and helping our downstream conversational model. In some use cases, our conversational model and intent classification model are the same where we train one intent classification to generate a response, and the others to generate a “tag” that tells downstream processes to start.
For this example use case, we’re going to separate the conversational model and intent classification model. User inputs or queries will be passed into the classification model, any processes required will occur, and then our conversational model will take the results and generate a response. We’ll use an ecommerce support chatbot that handles a few classes as an example.
Our dataset has a very simple structure as most text classification datasets do. We’re going to use a few classes for our different customer intent user queries:
From there we want to annotate the dataset for our different intent classifications. Our labels are simply what we want the model to generate when it thinks the user query is a specific class. This annotation can be done manually or sped up through few shot learning. For the few shot learning we want to create a GPT-3 prompt with a few examples of how to complete the task. We can run the rest of the dataset through the model and then review the outputs. This process is much quicker than annotating the entire dataset and becomes quicker each time you add to the few shot learning model.
There is no hard and fast rule for the number of fine-tuning examples that you should use for your intent classification model. It will depend on the size of the LLM and pretraining, the quality of the data and the complexity of the user queries, and the amount of overfitting you are willing to tolerate during the fine-tuning process. In general, it is better to have more examples rather than fewer, as long as we define training examples that are of high quality and are representative of real user queries to our system based on the use case. However, it is also important to keep in mind that adding more data will not necessarily improve the model if the data is not relevant to customer interactions. It is important that you have similar numbers of examples per class to prevent class imbalance. I’d recommend doing iterations of collecting data, annotating data, fine-tuning GPT-3 and evaluating model and data accuracy, and making a decision to repeat the iteration.
Make sure you have a stop sequence that shows the model what the end of a sequence looks like. This should be a token that does not show up in your completion. Each completion should start with whitespace as well.
Now that we have our dataset ready to go we can fine-tune our model to classify these intents. We’ll be walking through this at the command line but will have a bit of python as well. You’ll first need to install openai and add your API key as a variable called OPENAI_API_KEY.
The training data must be in a JSONL file format with the prompt and completion input and output format. The prompt should be our provided user query and our completion is our classification. More advanced systems can have a bit of text to go along with the classification to provide the user with a bit of information about what we’re doing on our side. Something like “Give me one moment while I look this up <<Human Support>>”. OpenAi also has a very easy to use CLI data preparation tool to take your input data in different formats and put it in JSONL.
This tool takes in different types such as CSV, TSV, XLSX, JSON and, JSONL. It suggests changes to make your data compatible with fine-tuning!
Now we can create a fine-tuned model! The model you choose to use should depend on your constraints around cost and runtime you care about for your chatbot. Remember this is just a piece of the entire pipeline and isn’t the only model or service you’ll be using. Here are a few things to consider:
1. Davinci is the most accurate model when comparing base forms (just the pretrained models). I recommend using it if you don’t have a lot of training samples for the different classes, or not have a lot of examples per class. The equation of what model to use completely changes once you have a lot of data. Fine-tuning Ada will be much cheaper and faster and could compete with davinci in your use case once fine-tuned. All rules about model size and accuracy go out the door when talking about fine-tuning and honing in a specific use case (I proved this here).
2. Your prompts used for fine-tuning are not at the same level of complexity as what is required for prompt based LLM interactions. You do not want to have a few-shot learning prompt for fine-tuning as you might have for interacting with the base models.
3. Classification is a fairly easy task for most LLMs which means the models require less data to get “familiar” with the task you’re trying to accomplish. I’d focus on data variance coverage and showing these models the range of inputs that you expect to see. This is much more valuable to me than trying to hammer in accuracy on a smaller data variance set.
4. Do not rely on only quality inputs. In my experience, most users that use chatbots use short forms, poor grammar, and statements rather than quality questions with punctuation. Keep this level of data variance in your samples so your model learns a stronger correlation to real user runs. Only after fine-tuning the model and seeing low accuracy would I go back and add more complete statements.
5. Remove junk characters. This should be standard in an NLP problem but is extra important in our shorter form inputs.
6. Your training samples should be vetted by humans. Ensure that the results are correct and that your classes make sense. The model is going to treat these as golden quality.
Take a look at the OpenAi fine-tuning guide for further information on the fine-tuning process.
With your intent classification model, you can now say you’ve got one of the pieces in place for building a full production GPT-3 chatbot. Intent classifiers are one of the most valuable parts of the equation as it kicks off downstream processes and helps our conversational model better understand where we are in the conversation. We use these exact models as a part of our production grade GPT-3 chatbots.
The use of customizable and in-domain chatbots continues to grow as more companies look to provide easier assistance to users for services.
Sales driven chatbots can become 30% of a store's sales. (Source)
Abandoned cart chatbots can boost revenue by 25%. (Source: Chatbots Mag)
Intent classification will continue to be the foundation of chatbots no matter the platform as long as we continue to push the boundaries of what we ask these pipelines to do. As we continue to ask chatbots to perform downstream tasks outside of generic conversation the accuracy of these systems will be depended on more and more.
Now that you have a custom intent detection model with GPT-3, you may be wondering how to make it more efficient and accurate. To do this, you can use a few different techniques such as active learning, transfer learning, and hyperparameter tuning. Active learning is a process of utilizing user feedback to improve your model based on user actions. By having users input their queries into the bot and manually providing feedback, you can use this data to continue to refine your model and make it more accurate. Transfer learning is another method of improving your model's accuracy. By taking a model that has been pre-trained on a large dataset, you can use this to fine-tune your model and achieve better performance with less data. Finally, hyperparameter tuning is a method of optimizing your model by changing the parameters used in the training process. By making small changes to the learning rate, batch size, and other parameters, you can improve the accuracy of your model. These techniques can all be used to improve the efficiency and accuracy of your custom intent classification model for conversational AI. With the right combination of techniques and data, you can create a model that is efficient, accurate, and capable of handling a wide range of user queries.
Width.ai builds custom natural language processing software (like chatbots!) for companies looking to leverage models to automate business processes or expand product capabilities. We’ve built chatbots for sales, ecommerce, and for automating coaching. Let’s set up some time to chat about how a chatbot fits with your business or how intent classification can help your chatbot.