How Good Is the DollyV2 Large Language Model? 2 Use Cases To Review

Matt Payne
August 15, 2023
training functionality for dollyv2

The growth of the use of LLMs with domain-specific fine-tuning for NLP workflows has allowed for the growth of a new focus on ease of fine-tuning and open source LLMs. Fine-tuning is the best way to steer your model towards your specific use case, and open source LLMs make this as cost effective as possible and improve the clarity around how the model was trained and what data was used, which is valuable for this domain steering process.

In this article, we dive into DollyV2, a self-hosted and open-source LLM that's completely free for commercial use. We cover:

  • What's special about it?
  • How does it work?
  • What's it good for?

Overview of DollyV2

DollyV2 (or more formally, Dolly 2.0) is a set of three LLMs from Databricks that are tuned to follow human instructions, just like OpenAI's ChatGPT.

But unlike ChatGPT and other models that tend to have restrictive licenses or are managed services, DollyV2 and its base models are licensed under the Apache 2.0 license which makes them completely unrestricted for absolutely any kind of commercial use.

The DollyV2 models use EleutherAI's Pythia models as their backbones. EleutherAI is also the company behind the GPT-J and GPT-Neo models.

To create the DollyV2 models, Databricks fine-tuned three of the pre-trained Pythia models on a custom human instruction dataset, databricks-dolly-15k, prepared by their employees through gamified crowdsourcing.

All three DollyV2 models are available from the Databricks Hugging Face Hub. They differ in their parameter sizes and, therefore, generative capabilities:

  • dolly-v2-12b: With 12 billion parameters, this is the largest and most capable model but requires a beefy GPU with 25 GB or more.
  • dolly-v2-7b: This is a model with 7 billion parameters for use on GPUs with 16 GB or more.
  • dolly-v2-3b: With just 3 billion parameters, this is the lightest model that's usable on low- to mid-range GPUs with 8-12 GB.

Why Would You Want to Use DollyV2?

DollyV2 offers multiple benefits to any company that wants to integrate private LLMs into their NLP systems.

1. Control Over Fine-Tuning

Unlike ChatGPT and similar managed LLMs, you control the fine-tuning process for the pre-trained open-source models. You no longer have to pay per token or record and have full access to metrics and evaluation. Your data scientists will feel more comfortable with the flexibility and clarity around what is going on.

2. Flexible Optimization & Infrastructure

You can deploy the DollyV2 models on your preferred cloud or on-premise infrastructure. When you need better latency or throughput, you have the freedom to scale up or scale out on demand by deploying additional cloud infrastructure. For example, if you're an e-commerce company, you can scale up your DollyV2-based customer chatbot during peak shopping periods and scale it down later. You can't do that with managed service LLMs like ChatGPT.

3. Data Confidentiality

If you're a financial institution or a health care service where compliance with data privacy and confidentiality are mandatory, models like DollyV2 are easier to secure than externally hosted managed service LLMs. You can fine-tune DollyV2 without exposing any confidential data to any other company. This also goes the same for inference time. The data never needs to leave your servers.

With managed services like ChatGPT, you are forced to upload your data to their systems and trust them to maintain your data security posture.

4. Unrestricted License

DollyV2's Apache 2.0 license permits you to use the models for any commercial purpose without any restrictions. You are free to sell products or deploy services that use these models. This is not the case with all open source LLMs, as many have royalties or lack the very open license for commercial use.

Use Case Demos

In this section, we examine how the most capable DollyV2 model, the dolly-v2-12b, fares in common natural language processing use cases.

1. Chatbots

The illustration below depicts a chat session with the DollyV2 model:

chatbot use case with dollyv2

Some positive observations:

  • The instruction-following capability of the model is good. It didn't generate irrelevant responses. It also doesn’t use unnecessary language or sentences in the response. One of the main issues I hear from customers using ChatGPT is that the responses are generally pretty long, and have irrelevant information alongside the relevant information. You’ll ask ChatGPT to give you a summary of an article and it will start the response with “Okay, here is a summary of the article in a bulleted list” where we just want the list to come back.
  • The pronoun references like they and them, also known as anaphora, are resolved perfectly. It understands that "they" in the second question and "them" in the third question are referring to the books in the first reply.

Some negative observations:

  • The book and author names are all hallucinated. Since the DollyV2 models were extensively trained on multiple book corpora, this is quite unnecessary. This suggests that DollyV2's training process may be suboptimal.

Chatbot Implementation Using LangChain

What's not so obvious in the illustration above is that DollyV2 isn't capable of conversations out of the box. It's a stateless model which means it doesn't remember any previous queries and responses. Every query is treated as a new conversation.

To overcome this, you can use LangChain's conversation chain and buffer memory. It stores the sequence of queries and their replies in your system's memory. With every new query, it includes the entire previous conversation. This is how DollyV2 can understand the previous context and resolve the anaphora correctly.

2. Zero-Shot Summarization for Medical Reports

When reviewing LLMs I like using the medical report use case as they often have multiple page formats in the same report and have a high noise level relevant to the important information. It gives me a good gauge of the model's ability to handle rough inputs.  

For the medical report below, we asked it to generate a simple summary:

zero shot summarization of medical documents with dollyv2

It generated the following summary:

output from dollyv2


  • At first, the summary seems extractive.
  • But the almost perfect replication of sentences from the original report suggests that the model may be simply mirroring the input unmodified.

Other prompts also generated summaries that contained grammatical errors or hallucinations (marked in red in the illustrations below):

new prompts for the same dollyv2 task
New temperature for the same task
new temperature and new tasks
Temperature changes to the prompt

Overall, DollyV2's chatbot capabilities work OK. However, its summarization capabilities are not ready for production. They tend to have syntactic flaws, grammatical errors, or hallucinations. We suggest additional fine-tuning on summarization datasets.

Under the Hood of the DollyV2 Models

The DollyV2 and Pythia models are all based on the GPT-NeoX model architecture and modeled by the Transformers framework's GPTNeoXForCausalLM.

The three DollyV2 models differ in their internals as shown in this table:

architecture information for dollyv2 at the different param sizes

Each hidden layer consists of a multi-head attention unit and a multi-layer perceptron (MLP). Since each model has a different number of attention heads and MLP neurons, the total number of parameters per hidden layer is different. The dolly-v2-12b model has almost four times the number of parameters per layer as the dolly-v2-3b model.

The layers of the dolly-v2-12b model are depicted below:

training information for GPT Neo
Layers of the dolly-v2-12b model

DollyV2 Training Datasets

Two datasets are in play when training or using DollyV2. One is the dataset used to train the underlying Pythia language model from scratch. The other is the dataset used to fine-tune it to follow instruction prompts. We explore both datasets and their implications on the model's capabilities below.

1. The Pile for Language Model Training

The Pile consists of diverse English (mostly) text corpora, weighing around 880 gigabytes in total. It includes:

  • Medical academic papers: It includes PubMed Central and PubMed abstracts.
  • Legal texts: U.S. court opinions from the Free Law project are included.
  • Scientific and technical papers: Research papers from arXiv and patent documents from the U.S. Patent and Trademark Office are included. These cover subjects like computer science, physics, mathematics, applied science, and engineering fields.
  • Social sciences: PhilPapers, a corpus of philosophical publications, is included.
  • Fiction and non-fiction books: It includes books accessed via Gutenberg, BookCorpus, and Biblotik. However, the latter two datasets have legal issues around them.
  • Web content: It includes a diverse set of web corpora like the Common Crawl, OpenWebText2, the English Wikipedia, and YouTube subtitles.
  • Multilingual corpora: It contains very few multilingual datasets like EuroParl and possibly some multilingual text from the web content. Overall, it's 97.4% English text and only about 2.6% non-English languages.

2. Databricks-Dolly-15K for Instruction Prompt Training

Databricks-dolly-15k is a dataset for instruction prompting created by Databricks employees through an internal gamified crowdsourcing process.

It contains 15,000 entries that conform to the recommendations of the InstructGPT paper. They include conversational categories like:

  • Chatting
  • Brainstorming
  • Classification
  • Close-ended and open-ended question answering
  • Text generation
  • Information extraction
  • Summarization

Each entry consists of:

  • Instruction
  • Optional context
  • Response
  • Category

Some examples are shown below:

output dataset from Databricks-Dolly-15K for Instruction Prompt Training

Sample entries in the databricks-dolly-15k dataset (Source: databricks-dolly-15k)

The dataset is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License which allows modifications for commercial use as long as they're also distributed under the same license.

The entries in this dataset are used to fine-tune a pre-trained LLM like Pythia toward instruction following. After fine-tuning, the model weights are modified such that they learn to generate appropriate responses following the user's instructions.

DollyV2 Training

LLMs can be fine-tuned for instruction-following using one or both of:

  • Supervised fine-tuning
  • Reinforcement learning from human feedback

Supervised fine-tuning, using the databricks-dolly-15k dataset, is employed to create the DollyV2 models from their Pythia backbones. The instructions and contexts in the dataset are the inputs and the responses are the expected responses.

Token-level, cross-entropy loss between the expected and generated responses is sought to be minimized using the AdamW optimizer algorithm. This modifies the attention and multi-layer perceptron weights in the model's layers toward generating the expected responses.

Since these are large models, the training must be distributed across a couple of GPUs. To do this efficiently, the DollyV2 training process relies on the DeepSpeed library.

When Not to Use DollyV2 — Current Shortcomings

The available pre-trained DollyV2 models are not ideal for some use cases without fine-tuning:

  • Involves non-English languages: Neither the DollyV2 nor the backbone Pythia models are trained extensively in any language besides English. And fine-tuning isn't a good strategy to learn the innumerable linguistic characteristics of a different language. To handle another language, you'll have to expand the parameter capacity of these models and train them from scratch on that language's datasets alongside The Pile.
  • Large inputs: DollyV2's token windows are just 2,048 tokens at a time when most managed LLMs average around 4K and some go even beyond 32K. Anything that involves large inputs may require chunking and similar strategies and possibly lead to subpar results. This is a huge issue moving forward as we continue to see these token windows increase in size. While long contexts can lead to worse performance, as we’ve outlined and is outlined in this research paper, customers continue to want to push the boundaries of the context window. At the current token limit DollyV2 is more suited for chatbots and Q&A than long transcript summarization or other use cases.
  • Model Size: DollyV2 doesn’t have any versions that move into the range of 100B parameters. Some use cases trying to compete with ChatGPT out of the box require this size, and other open source models usually offer at least one version of this size.

In addition, because DollyV2 is a research-oriented model under development, it may show the following problems listed by Databricks:

  • Can't handle complex prompts
  • Not good at open-ended question answering
  • Doesn't do letter-writing that's properly formatted
  • Fails at code generation
  • Can't handle mathematical operations
  • Produces factual errors
  • Doesn't handle dates and times well
  • Can't produce lists of specific lengths
  • Trouble copying writing style or tone
  • Doesn't reproduce a sense of humor
  • Not the best performer in terms of latency and throughput

So What Can You Use DollyV2 For?

Knowing its training dataset details and known shortcomings, here are some use cases for which DollyV2 may work well after some more fine-tuning on related datasets:

  • Scientific paper search: The pretraining on scientific papers and the prevalence of simple queries in typical search use cases means that DollyV2 may work fairly well for this use case.
  • Health care chatbots: The PubMed datasets enable it to answer simple health care questions.
  • Literature search: Its extensive use of large book corpora enables information retrieval from both fiction and non-fiction books.
  • Q&A Systems: Q&A on smaller context sizes such as customer reviews are perfect for this context size.  

Models Like DollyV2 Make Custom LLMs Practical

While ChatGPT is good for many use cases, most businesses will have some unique needs and workflows that ChatGPT can't satisfy. Free, open-source LLMs like DollyV2 are critical to making highly customized models practical for businesses.

At Width, we have years of experience in customizing GPT and other language models for businesses. Contact us!