How to Train and Deploy Vicuna and FastChat LLMs

Karthik Shiraly
August 21, 2023
Fastchat workflow and how ShareGPT works

For businesses, open-source large language models (LLMs) are attractive alternatives to managed service LLMs like ChatGPT because they're more customizable, more amenable to different customer deployment scenarios, and provide better data security and confidentiality.

In this guide, we explore the Vicuna LLM and its features, performance, and capabilities.

Overview of Vicuna and FastChat

When OpenAI's ChatGPT became famous in early 2023 for its versatile instruction-following capabilities, many other companies in the AI space also launched attempts to create similar instruction-following LLMs.

Meta was one of them with its Large Language Model Meta AI (LLaMA), an LLM meant for non-commercial, limited-access research uses. But, somehow, its weights — the key secret sauce that powers an LLM — were leaked. Being one of the few LLM weights available at the time from a reputable company, an entire ecosystem emerged around it quickly. Some of its popular derivative projects include:

  • Stanford's Alpaca: Alpaca fine-tuned LlaMA into an instruction-following LLM similar to ChatGPT.
  • Llama.cpp: This is an optimized LLaMA that can run on consumer-grade hardware.

In this ecosystem, Large Model Systems Organization (LMSYS) aims to release open-source and publicly-accessible LLMs. Its Vicuna is one such set of ChatGPT-style, instruction-following chatbot models created by fine-tuning LLaMA. It has a context length of 2,048 tokens.

What makes Vicuna special? Through automated evaluation of chat responses using OpenAI's GPT-4, the chatbot claims that it:

  • Achieves a remarkable 90% of ChatGPT's quality
  • Outperforms LLaMA and Alpaca in 90% of evaluation tests

Unfortunately for businesses, Vicuna is only meant for non-commercial research uses. But LMSYS also publishes software called FastChat that enables businesses to create customized, Vicuna-like, open and commercial LLMs.

What Is FastChat?

FastChat is a framework for training and evaluating instruction-following LLMs and deploying them in production as software services with standard application programming interfaces (APIs). We explore each of these aspects in depth in later sections.

Vicuna was trained and evaluated using FastChat. With FastChat, you can create and deploy your own customized, domain-specific, LLM-based chatbots to help your customers and employees.

Other FastChat Models

In addition to Vicuna, LMSYS releases the following models that are also trained and deployed using FastChat:

  • FastChat-T5: T5 is one of Google's open-source, pre-trained, general purpose LLMs. FLAN-T5 fine-tuned it for instruction following. FastChat-T5 further fine-tunes the 3-billion-parameter FLAN-T5 XL model using the same dataset as Vicuna. What makes it special is that it has a larger context length of 4,000 tokens, its encoder is bidirectional, and you can use it for any commercial use under the Apache 2.0 license.
  • LongChat: LongChat is another fine-tuned, instruction-following LLaMA derivative similar to Vicuna. What makes it special is its context length of 16,000 tokens, which makes it suitable for processing long-form documents.

In the next section, we try out Vicuna on some common text-processing use cases.

How Does Vicuna Fare on Common Use Cases?

The use cases below cover some common natural language processing tasks that are useful to businesses. We explore how the Vicuna-13b (13 billion parameters) model fares on them using qualitative assessments.

Use Case 1: Text Generation

For the first text generation experiment, we ask Vicuna-13b to explain the attention mechanism of transformer neural networks. Since LLaMA's trained on arXiv and Wikipedia datasets, it should be able to give a satisfactory answer.

text generation with Vicuna LLM

The explanation is correct and quite thorough.

Use Case 2: Text Generation With Humor and Satire

Let's see how well it does with more creative text generation.

Text generation with humor focus using Vicuna

The LLM has included the elements we asked for and generated a complete and coherent story.

Use Case 3: Zero-Shot Extractive Summarization

We ask Vicuna to generate an extractive summarization for this TED talk on the philosophy and foundations of modern science. Due to context length limits, we restrict it to the first 1,024 characters of the lecture. Here is the extracted summary:

Zero-Shot Extractive Summarization with Vicuna

Vicuna has followed the instructions and strictly extracted key sentences from the given text without any prompt engineering. This is quite impressive because we often find that without very strict and specific instructions, even GPT-4 has trouble sticking to instructions and tends to generate abstract summaries or snippets of sentences. We’ve had to do a ton of prompt engineering in use cases like GPT-4 dialogue summarization to ensure the model extracts only sentences.  

Use Case 4: Zero-Shot Abstractive Summarization

In many situations, an extractive summary might be too restrictive due to the lack of context around the sentences extracted. Sentences that are pulled like the above “They usually talk about the 17th century” are only valuable if you already understand the full dialogue. This is where an abstractive summary can be useful. Abstractive summaries are paraphrased and flow smoothly while providing a more contextual rich version of all the key points in the text. For the same TED talk, here's Vicuna's abstractive summary:

zero shot abstractive summarization with vicuna

Vicuna Under the Hood

The Vicuna models are just LLaMA under the hood. They have autoregressive, decoder-only (no encoder) network architectures similar to other transformer-based LLMs like GPT-4.

They consist of several LLaMA decoder layers. Each decoder layer contains a multi-head self-attention block and a multi-layer perceptron (MLP).

Each multi-head self-attention block generates hidden states that embed information on how much attention each token should pay to every other token along different linguistic aspects.

Each MLP network in a layer applies nonlinear activations on the respective attention block's hidden states.

The input to the models is a list of token embeddings that can contain up to 2,048 embeddings. The outputs are the next predicted tokens where each token is influenced by its previous token.

The Vicuna and LongChat models, all based on the LLaMA decoder-only transformer architecture, are summarized below:

model architecture for vicuna and long chat

Vicuna Training and Fine-Tuning

The Vicuna models are created by fine-tuning LLaMA models on crowdsourced ChatGPT conversations sourced from ShareGPT.

Since all the responses on ShareGPT are from ChatGPT, such fine-tuning is effectively a type of knowledge distillation where the smaller model (Vicuna) learns to generate responses that are close to those from the larger model (ChatGPT).

Using this dataset, the LLaMA models undergo supervised fine-tuning where the objective is to reduce the differences between their responses and the ChatGPT-generated responses.

An example conversation from ShareGPT is shown below:

ShareGPT example

Sample ChatGPT conversation for fine-tuning Vicuna (Source: ShareGPT)

LLaMA itself is trained on diverse corpora like the Common Crawl project, Wikipedia, arXiv, and more. They include English and non-English data.

Currently, there are no Vicuna or other FastChat models trained using the state-of-the-art approach of reinforcement learning with human feedback.

But you can further fine-tune these LLMs on consumer-grade hardware using the highly-optimized quantized low-rank adapters (QLoRA) technique. It enables you to fine-tune even a 65-billion-parameter model on a single 48GB GPU. We’ve started using LoRA in all of our open source LLM training.

Vicuna Evaluation

FastChat comes with a built-in response evaluation web application called MT Bench. It queries LLMs using pre-defined prompts and asks GPT-4 to judge which LLM's response is best and why. It can either compare two custom LLMs with each other or a custom LLM with baseline GPT-3.5 or GPT-4 responses. GPT-4 is awesome at evaluating and ranking outputs.

The pre-defined prompts are grouped into categories and cover a variety of queries. The categories are listed below, with an example prompt for each:

  • Writing: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.
  • Roleplay: Pretend yourself to be Elon Musk in all the following conversations. Speak like Elon Musk as much as possible. Why do we need to go to Mars?
  • Reasoning: Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what’s your current position? Where is the person you just overtook?
  • Data extraction: Extract the following information from the presented texts: the name of the book, the author, the main character, and the year of publication. Output in the format of "main character, book, author, year of publication,” with one book per line.
  • STEM: In the field of quantum physics, what is superposition, and how does it relate to the phenomenon of quantum entanglement?
  • Humanities: Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies affect those indicators.
  • Coding: Develop a Python program that reads all the text files under a directory and returns the top five words with the most number of occurrences.
  • Math: The vertices of a triangle are at points (0, 0), (-1, 1), and (3, 3). What is the area of the triangle?

The two LLMs being evaluated generate responses for the prompts as shown below:

LLM evaluation for vicuna and alpaca
MT Bench LLM responses (Source: MT Bench)

GPT-4 judges their responses and declares a winner along with an explanation:

gpt-4 output judgment
GPT-4 LLM judging Vicuna vs. Alpaca responses (Source: MT Bench)

This entire process is automated. You can incrementally fine-tune your model and evaluate it using MT Bench until it shows up consistently as the winner across all categories of prompts.

You can also add your business-specific or domain-specific test prompts to MT Bench by modifying the questions file. Plus, you can modify the GPT-4 judging prompts, for example, to include business-specific context for the evaluation.

Vicuna and FastChat Deployment

FastChat provides many deployment options for Vicuna and many other LLMs, both for research and production purposes. Basically, FastChat acts as a bridge interface that stays stable while the application backend may change. Its support includes:

  • OpenAI API: You can replace an OpenAI GPT model with your custom LLM in the application backend while the rest of the application continues to use OpenAI client libraries. Since the interface is retained, the client libraries think they're talking to GPT but are talking to your custom LLM.
  • Anthropic API: Similarly, if your application is using Anthropic's Claude LLM, you can drop in your custom LLM in its place without affecting the rest of your application.
  • Bard and PaLM APIs: In place of Google's Bard LLM or PaLM API, you can deploy your custom LLM as a drop-in replacement.
  • Gradio deployer: FastChat enables you to quickly build web application prototypes with user interfaces for your LLMs by deploying them to Gradio. FastChat also provides a handy configuration file for the Nginx web server that enables firewalling for your Gradio servers, dynamical mounting and unmounting, load balancing across multiple servers, and application security features.
  • vLLM deployment: FastChat enables you to deploy your LLM in production with vLLM.
  • Hugging Face command-line interface: FastChat provides a simple command-line interface that uses the Hugging Face Transformers AutoTokenizer and AutoModelForCausalLM APIs to generate a single response for a single prompt supplied via the command-line.
  • Command-line interface: FastChat provides a chat command-line interface for back-and-forth chatting with a model via the command line.

The GPU memory requirements for the Vicuna models are:

ram required for vicuna

For example, if you want to deploy a Vicuna model that responds to the OpenAI API, first start the controller daemon:

fastchat deployment

Then start the LLM inference daemon with your preferred LLM:

deployment of llm with daemon

Finally, start the OpenAI API server daemon:

openai api server deployment

Your Vicuna LLM is now available at You can use it as a drop-in replacement for OpenAI's GPT models while the rest of your application logic continues to use the OpenAI client library:

vicuna is now available and can be seen at the url

Shortcomings of Vicuna

We summarize some of the inherent and observed shortcomings of Vicuna here:

  • Non-commercial license: For businesses, the biggest shortcoming may be the non-commercial license that Vicuna and LongChat inherit from LLaMA. FastChat-T5 from the same stable can be used commercially for now. But remember that there's always a risk of the license being changed later.
  • Risky crowdsourced dataset: The crowdsourced nature of ShareGPT means inaccurate answers, personal data, and malicious answers can leak into the model. FastChat does attempt to clean up the data, but it's automated and doesn't consider the inappropriateness of the content. Use a safer dataset like the databricks-dolly-15k instead to fine-tune it for instruction-following and then fine-tune it further on your custom data.
  • Small context length: Vicuna is limited to 2,048 tokens. However, LongChat supports 16,384 tokens. The same tricks that LongChat uses can be applied to FastChat-T5 to train your own long-context LLM for commercial use.

Using Vicuna and FastChat for Your LLM Workflows

In this article, we explored the language capabilities of the Vicuna family as well as the training and deployment capabilities of the FastChat platform. They enable you to experiment and customize LLMs for your custom business needs while keeping your business data confidential.

For example, if you're a law firm, you can train these custom models on your confidential legal documents without uploading them to OpenAI or other managed LLMs. Same with medical reports where private health information is protected by strict data security requirements.

Contact us to discuss how you can customize and deploy Vicuna and other language models for your business workflows.