Turbocharge Dialogflow Chatbots With LLMs and RAG
Are Dialogflow chatbots still relevant when LLM-based chatbots are available? Yes. We'll show you how you can combine them to get the best of both worlds.
For businesses, open-source large language models (LLMs) are attractive alternatives to managed service LLMs like ChatGPT because they're more customizable, more amenable to different customer deployment scenarios, and provide better data security and confidentiality.
In this guide, we explore the Vicuna LLM and its features, performance, and capabilities.
When OpenAI's ChatGPT became famous in early 2023 for its versatile instruction-following capabilities, many other companies in the AI space also launched attempts to create similar instruction-following LLMs.
Meta was one of them with its Large Language Model Meta AI (LLaMA), an LLM meant for non-commercial, limited-access research uses. But, somehow, its weights — the key secret sauce that powers an LLM — were leaked. Being one of the few LLM weights available at the time from a reputable company, an entire ecosystem emerged around it quickly. Some of its popular derivative projects include:
In this ecosystem, Large Model Systems Organization (LMSYS) aims to release open-source and publicly-accessible LLMs. Its Vicuna is one such set of ChatGPT-style, instruction-following chatbot models created by fine-tuning LLaMA. It has a context length of 2,048 tokens.
What makes Vicuna special? Through automated evaluation of chat responses using OpenAI's GPT-4, the chatbot claims that it:
Unfortunately for businesses, Vicuna is only meant for non-commercial research uses. But LMSYS also publishes software called FastChat that enables businesses to create customized, Vicuna-like, open and commercial LLMs.
FastChat is a framework for training and evaluating instruction-following LLMs and deploying them in production as software services with standard application programming interfaces (APIs). We explore each of these aspects in depth in later sections.
Vicuna was trained and evaluated using FastChat. With FastChat, you can create and deploy your own customized, domain-specific, LLM-based chatbots to help your customers and employees.
In addition to Vicuna, LMSYS releases the following models that are also trained and deployed using FastChat:
In the next section, we try out Vicuna on some common text-processing use cases.
The use cases below cover some common natural language processing tasks that are useful to businesses. We explore how the Vicuna-13b (13 billion parameters) model fares on them using qualitative assessments.
For the first text generation experiment, we ask Vicuna-13b to explain the attention mechanism of transformer neural networks. Since LLaMA's trained on arXiv and Wikipedia datasets, it should be able to give a satisfactory answer.
The explanation is correct and quite thorough.
Let's see how well it does with more creative text generation.
The LLM has included the elements we asked for and generated a complete and coherent story.
We ask Vicuna to generate an extractive summarization for this TED talk on the philosophy and foundations of modern science. Due to context length limits, we restrict it to the first 1,024 characters of the lecture. Here is the extracted summary:
Vicuna has followed the instructions and strictly extracted key sentences from the given text without any prompt engineering. This is quite impressive because we often find that without very strict and specific instructions, even GPT-4 has trouble sticking to instructions and tends to generate abstract summaries or snippets of sentences. We’ve had to do a ton of prompt engineering in use cases like GPT-4 dialogue summarization to ensure the model extracts only sentences.
In many situations, an extractive summary might be too restrictive due to the lack of context around the sentences extracted. Sentences that are pulled like the above “They usually talk about the 17th century” are only valuable if you already understand the full dialogue. This is where an abstractive summary can be useful. Abstractive summaries are paraphrased and flow smoothly while providing a more contextual rich version of all the key points in the text. For the same TED talk, here's Vicuna's abstractive summary:
The Vicuna models are just LLaMA under the hood. They have autoregressive, decoder-only (no encoder) network architectures similar to other transformer-based LLMs like GPT-4.
They consist of several LLaMA decoder layers. Each decoder layer contains a multi-head self-attention block and a multi-layer perceptron (MLP).
Each multi-head self-attention block generates hidden states that embed information on how much attention each token should pay to every other token along different linguistic aspects.
Each MLP network in a layer applies nonlinear activations on the respective attention block's hidden states.
The input to the models is a list of token embeddings that can contain up to 2,048 embeddings. The outputs are the next predicted tokens where each token is influenced by its previous token.
The Vicuna and LongChat models, all based on the LLaMA decoder-only transformer architecture, are summarized below:
The Vicuna models are created by fine-tuning LLaMA models on crowdsourced ChatGPT conversations sourced from ShareGPT.
Since all the responses on ShareGPT are from ChatGPT, such fine-tuning is effectively a type of knowledge distillation where the smaller model (Vicuna) learns to generate responses that are close to those from the larger model (ChatGPT).
Using this dataset, the LLaMA models undergo supervised fine-tuning where the objective is to reduce the differences between their responses and the ChatGPT-generated responses.
An example conversation from ShareGPT is shown below:
Sample ChatGPT conversation for fine-tuning Vicuna (Source: ShareGPT)
LLaMA itself is trained on diverse corpora like the Common Crawl project, Wikipedia, arXiv, and more. They include English and non-English data.
Currently, there are no Vicuna or other FastChat models trained using the state-of-the-art approach of reinforcement learning with human feedback.
But you can further fine-tune these LLMs on consumer-grade hardware using the highly-optimized quantized low-rank adapters (QLoRA) technique. It enables you to fine-tune even a 65-billion-parameter model on a single 48GB GPU. We’ve started using LoRA in all of our open source LLM training.
FastChat comes with a built-in response evaluation web application called MT Bench. It queries LLMs using pre-defined prompts and asks GPT-4 to judge which LLM's response is best and why. It can either compare two custom LLMs with each other or a custom LLM with baseline GPT-3.5 or GPT-4 responses. GPT-4 is awesome at evaluating and ranking outputs.
The pre-defined prompts are grouped into categories and cover a variety of queries. The categories are listed below, with an example prompt for each:
The two LLMs being evaluated generate responses for the prompts as shown below:
GPT-4 judges their responses and declares a winner along with an explanation:
This entire process is automated. You can incrementally fine-tune your model and evaluate it using MT Bench until it shows up consistently as the winner across all categories of prompts.
You can also add your business-specific or domain-specific test prompts to MT Bench by modifying the questions file. Plus, you can modify the GPT-4 judging prompts, for example, to include business-specific context for the evaluation.
FastChat provides many deployment options for Vicuna and many other LLMs, both for research and production purposes. Basically, FastChat acts as a bridge interface that stays stable while the application backend may change. Its support includes:
The GPU memory requirements for the Vicuna models are:
For example, if you want to deploy a Vicuna model that responds to the OpenAI API, first start the controller daemon:
Then start the LLM inference daemon with your preferred LLM:
Finally, start the OpenAI API server daemon:
Your Vicuna LLM is now available at http://127.0.0.1:8000. You can use it as a drop-in replacement for OpenAI's GPT models while the rest of your application logic continues to use the OpenAI client library:
We summarize some of the inherent and observed shortcomings of Vicuna here:
In this article, we explored the language capabilities of the Vicuna family as well as the training and deployment capabilities of the FastChat platform. They enable you to experiment and customize LLMs for your custom business needs while keeping your business data confidential.
For example, if you're a law firm, you can train these custom models on your confidential legal documents without uploading them to OpenAI or other managed LLMs. Same with medical reports where private health information is protected by strict data security requirements.
Contact us to discuss how you can customize and deploy Vicuna and other language models for your business workflows.