OpenLLaMA: Evaluating the Open-Source LLM on Language Tasks

Matt Payne
August 28, 2023
OpenLLaMA vs LLaMA

In a February 2023 paper, Meta claimed that its 13-billion-parameter LLaMA large language model (LLM) outperformed GPT-3. Is its open-source clone, OpenLLaMA, equally good? How does it fare on common language tasks? Can you reliably use it for your business? Find out in this article.

What Is OpenLLaMA?

LLaMA is an LLM that Meta licensed for research use. LLaMA grew popular when Meta’s model weights got leaked, possibly unlawfully. That enabled an ecosystem of custom LLaMA models to emerge. Such self-hosted, custom LLaMA models are potentially beneficial to businesses to automate many workflows, but the associated licensing and legal risks are not worth it.

In this scenario, businesses prefer less risky LLMs whose weights and code are open-sourced with clear licenses. OpenLLaMA is one such family of open-sourced, permissively licensed LLMs created by OpenLM Research. It aims to reproduce the LLaMA model accurately based on information from the LLaMA research paper and other sources.

The OpenLLaMA model weights as well as the EasyLM framework used to model and train them are published under the business-friendly Apache 2.0 license. Businesses are free to use, modify, or install them for commercial purposes without any restrictions.

In the rest of this article, we explore the OpenLLaMA models, their internals, and their capabilities.

OpenLLaMA Models

As of August 2023, OpenLLaMA has published five OpenLLaMA models:

comparison of OpenLLaMA versions based on architecture

All five models are available for both the Hugging Face transformers and Flax libraries.

Keep in mind that these OpenLLaMA models are base language models (like OpenAI's GPT-3), not instruction-following models like ChatGPT. To turn them into conversational assistants, you must fine-tune them on chat datasets, like databricks-dolly-15k, using either supervised fine-tuning or reinforcement learning from human feedback.

The second version models are better in two aspects:

  • More accurate tokenization: The first version models were configured with tokenizers that merged multiple whitespaces into one. As a result, whitespace-sensitive tasks like code generation didn't perform well. The second version models don't merge whitespaces.
  • Improved training dataset: In the new models, the ratios of different types of content in the training dataset are tweaked for better results. We cover these details in the Training Dataset section below.

But first, what's under the hood of these OpenLLaMA models?

OpenLLaMA Architecture

Like LLaMA, OpenLLaMA replicates the classic transformer decoder architecture but with some improvements. OpenLLaMA implements most of the same architectural improvements in the LLaMA paper:

  • Layer pre-normalization: Each attention block is normalized at the input rather than the output by using a root mean square normalization (RMSNorm) layer. This improves stability during training by preventing exploding and vanishing gradients.
  • Multi-layer perceptron (MLP) activation: The MLPs in OpenLLaMA's attention blocks apply sigmoid linear unit (SiLU) activations to their attention layers' outputs. This is different from LLaMA which prefers the Swish gated linear unit (SwiGLU). These choices help the models converge faster during training.
  • Rotary embeddings: Instead of using absolute positional embeddings, both OpenLLaMA and LLaMA encode token positions into the embedding vectors by using rotary embeddings. This enables longer context lengths and injects information about relative distances into the attention layers which helps improve result quality. As someone who is deeply involved in embeddings this is pretty interesting. The results in the research paper are good (all research papers show good results nowadays), although they don’t show any comparisons to SBERT which is a bit more SOTA than BERT or WoBERT.
RoFormer vs BERT embeddings

Training Dataset

The latest OpenLLaMA version two models are trained on the following datasets:

  • Falcon RefinedWeb: The Falcon RefinedWeb is a sanitized version of the Common Crawl web dataset consisting of billions of pages from around the web. Love these sanitized and cleaned up datasets for base model training as they train in a language format that fits a bit better with how models are actually prompted.
  • StarCoder: StarCoder is a massive dataset of programming code pulled from GitHub.
  • RedPajama: Second version models train on some subsets of the RedPajama collection which is an open-source replica of the datasets and content ratios that LLaMA uses. Specifically, the new models are trained on just the Wikipedia, arXiv, books, and StackExchange subsets of RedPajama.

The older first-version models are trained on just the entire RedPajama collection.


The OpenLLaMA models are trained with the following configuration:

  • Loss function: The loss function is standard cross-entropy loss.
  • Optimizer: It uses the AdamW adaptive optimizer.

The native format of the model checkpoints is Flax. EasyLM provides a conversion script to convert Flax to PyTorch models that you can load with Hugging Face transformers.

In the next section, we find out how OpenLLaMA does on common language tasks.

OpenLLaMA Language Tasks Test Drive

Since OpenLLaMA is a pure language model, the prompts must either ask for text completions or provide few-shot examples if you don't plan on fine-tuning it.

The tasks below demonstrate these use cases against the second version OpenLLaMA-7bv2 model.

1. Zero-Shot Text Generation

For text generation, we started by prompting it to generate something factual:

zero shot text generation use case

We observed that:

  • It printed the answer as a list.
  • The model does not know when to stop and keeps repeating text up to the maximum token length we specify.
  • This output falls victim to a very common issue with list generation with base GPT-3, and something that gave me PTSD remembering building prompting frameworks to work around. The model generates “Risk management” out of the gate with some given log probability. Each list item generation following references the one above as “correct” and the log probability increases. Creating a never ending loop of the log probabilities of “Risk” as the first word in the list item going up, which further increases the probability that the next first word will be “Risk”. It can only be stopped by the model choosing a lower probability token which becomes less likely as it keeps going. Having a high temperature is the only hope as it has to randomly decide to keep the top log probability result, or improving the prompting instructions and goal output definition. This was a common issue with the base model GPT-3, and while the results are not technically wrong, are not visually appealing and the results are not diverse.

I was able to recreate this pretty easily in base davinci GPT-3. The color coding provided for log probabilities shows that “Risk” as the first word becomes more and more likely as the list goes on. The list eventually just eats itself and produces the same thing over and over. Okay back to OpenLLaMA.

Next, I asked OpenLLaMA for something more creative, like a poem:

poem generation with OpenLLaMA

The first few lines of the generated text aren't great but aren't too bad either. It starts out having a poem feel even without any prompt instructions or goal state definition. It does fall into the same trap we saw above where it begins to repeat itself. A few solutions for this:

  • As with most of my blogs from 2021 using only the base GPT-3, strong prompting or fine-tuning is critical to guide the model towards what we want to do. Writing prompts with clear instructions, goal state output definition, and format rules is required for a model with no instruction based fine-tuning. Fine-tuning this model for task specific use is always what we recommend, and is even more valuable here.
  • Few shot prompting here helps the model understand the level of variation to include in the outputs. We can show that the poem should follow a specific format or level of variation in the output.

2. Few-Shot Intent Recognition

For a few-shot language task, we picked customer intent recognition. The idea was to supply the LLM with pairs of customer questions and corresponding intent categories. Would OpenLLaMA behave like a text classifier and output an intent category? This is a very popular model used in chatbot workflows as it helps us understand if we can leverage the models underlying training to answer the query, or if we need to leverage an external API such as inventory, mortgage calculator, or other services.

We supplied 11 few-shot examples with "Q:" being the customer queries and "A:" being the corresponding intent categories:

few shot text classification with the goal of intent recognition

Then we added the query we're interested in classifying:

query input for intent classification

The model generated the following:

generation results


  • The good news: It did classify the customer query correctly as "lost_or_stolen_card."
  • The bad news: It repeated the entire input prompt in the results and after the correct answer, continued to generate pairs of queries and categories. This occurs with GPT-3 as well when we do not specify a specific stop sequence. This isn’t really much of a concern given the use of a simple stop sequence will clean this up.

Another surprise was that it was quite sensitive to the syntax of the last query. Instead of the query above, we asked this query with the only change being that it ended with "A:":

extra query showing edge case issue

We expected it to start the completion with the intent category. But instead, it didn't generate any category at all in the beginning and then continued with garbage text:

no answer result in edge case


We conclude that:

  • It's quite sensitive to the prompt syntax. Your few-shot examples must be structured and focus on reaching the goal output in the exact output you want to reach.
  • The problem of text being generated till the maximum sequence length is seen here too. This can be cleaned up with stop sequences.
  • Overall, these problems imply that an application must do quite a lot of result validation on OpenLLaMA's results.

3. Zero-Shot Code Generation

Since OpenLLaMA is trained on the StarCoder coding dataset, we tested its code generation capabilities with this prompt:

zero shot code generation with OpenLLaMA

It generated this code:

code generation result

The code is syntactically perfect and runs but is functionally wrong. It doesn't parse basic URLs like "https://www.google.com/somepage" correctly. Nonetheless, getting the syntax right is impressive. Like the other tasks, some fine-tuning is recommended, or stronger prompting with frameworks such as ReAct.

We next examine fine-tuning of OpenLLaMA.

OpenLLaMA Fine-Tuning and Inference Using MosaicML

To get good results from general-purpose LLMs like OpenLLaMA or DollyV2, you almost always need to fine-tune them on your custom datasets for your specific language tasks. For example, if you want a custom chatbot for your banking or financial service business, you must fine-tune them on your specific frequently asked questions or sample customer dialogues.

In this section, we explain a fine-tuning and deployment workflow for custom LLMs using the MosaicML platform. MosaicML orchestrates all the infrastructure provisioning, training dataset streaming, and training session monitoring you'd need to easily adapt an LLM to your needs.

Benefits of MosaicML

MosaicML brings several beneficial features:

  • Automated infrastructure provisioning: MosaicML automatically configures the infrastructure you need for LLM training and inference. You don't have to worry about inconveniences like requesting GPU and server quotas from cloud providers, searching for available GPUs in different zones, and deploying the training software on them.
  • Distributed training: MosaicML has built-in support for large-scale distributed training and progress monitoring.
  • Fine-tuned model inference: Push your fine-tuned models into production and make them available to your software and clients via application programming interface (API) endpoints.
  • Data storage integration: MosaicML supports data ingestion from all the major storage clouds.
  • Hyperparameter tuning: MosaicML has built-in support for optimal hyperparameter searching.

MosaicML is pretty awesome and really does make it very easy to train any models. One of the most difficult things we work with clients on is the actual deployment of these models. Everyone is eager to train their own LLM for their specific use case, but nobody wants to talk about deployment! Deploying these and managing the infrastructure required is a challenging task on your own. MosaicML lets you do that with costs very similar to AWS.

How to Fine-Tune OpenLLaMA Using MosaicML

This section is an end-to-end walkthrough of the MosaicML fine-tuning workflow for your LLM.

1. Set Up Your Training Data

Upload your training data to cloud storage like AWS S3, Azure, or any popular S3-compatible provider.

2. Create a Script to Modify Your Data

MosaicML requires the data to be set up for data ingestion from multiple clouds using its Streaming framework.

3. Create a Docker Image for Running EasyLM

MosaicML orchestrates distributed training using Docker containers. Since OpenLLaMA training requires the EasyLM software, create a Docker image that can run the EasyLM training script.

4. Create a MosaicML Configuration for OpenLLaMA

Create a configuration file similar to the one shown below, changing the details to match your environment and Docker images.

MosaicML configuration file (Source: MosaicML)
MosaicML configuration file (Source: MosaicML)

Infrastructure provisioning is as simple as two or three lines in that file. How many GPUs or TPUs do you need? And what type? MosaicML handles the rest behind the scenes.

The command to fine-tune an OpenLLaMA model (include it in the "command:" section of the configuration file) would look something like this:

EasyLM fine-tuning (Source: EasyLM)
EasyLM fine-tuning (Source: EasyLM)

5. Start the Fine-Tuning Run

Use the MosaicML command-line utility to start the fine-tuning run:

fine-tuning start on OpenLLaMA

6. Monitor the Fine-Tuning

Use the same utility to monitor your runs:

run of the mosaicML train job

7. Access the Trained Model

Once the run ends, MosaicML uploads the trained model to the Hugging Face Hub or another repository. The next section covers how you can deploy your fine-tuned model.

Deploy Your OpenLLaMA Model Using MosaicML

You can also deploy your fine-tuned model to production with these steps.

1. Create a Deployment Configuration

The deployment configuration file tells MosaicML details like:

  • Which model should it deploy?
  • What infrastructure is required?

An example deployment configuration is shown below (for another model; change it to match OpenLLaMA):

Deployment configuration file (Source: MosaicML)
Deployment configuration file (Source: MosaicML)

2. Deploy the Model Using MosaicML

Use the command-line utility to deploy the model. MosaicML does all the provisioning needed to publish it:

deployment command

The model is automatically deployed and made available at an endpoint.

3. List Your Deployments

Each deployed model is given a unique name by MosaicML. You need that to send requests to the model. Run the utility to list all the deployments:

get deployment command

You'll see your deployments listed:

deployment list in MosaicML

4. Run Inference

You can submit prompts to the deployed model from your applications. The deployed model is identified by its MosaicML name. Use the following code to submit user prompts and get completions from the LLM:

run results

Use Open-Source LLMs Like OpenLLaMA to Streamline Your Business Workflows

In this article, we saw that OpenLLaMA's results are far from perfect. Using strategies like fine-tuning and prompt optimization, we specialize in integrating LLMs into your business workflows to help your employees and customers ask natural language questions and get informative answers and insights. Contact us for a free consultation.


  • Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971 [cs.CL]. https://arxiv.org/abs/2302.13971
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). "Attention Is All You Need." arXiv:1706.03762 [cs.CL]. https://arxiv.org/abs/1706.03762