A Practical Guide to Train an Open Source LLM on MosaicML

Matt Payne
November 2, 2023

Training a large language model (LLM) from scratch is not a trivial undertaking. The volume of data and the computing resources required involve all kinds of inherent and emergent complexities.

In this article, find out how to train an open source LLM on MosaicML, a platform that specializes in training LLMs and managing their training complexities.

What Is MosaicML?

MosaicML is a platform and its software for efficiently training large language models and deploying them for inference. It provides all the logic and tools you need to set up massively distributed training runs on very large datasets.

What LLM Problems Does MosaicML Solve?

Training LLMs involves several problems like:

  • GPU availability
  • Scaling out
  • Stack complexity
  • Fault tolerance

MosaicML addresses all these problems with features like:

  • Automated infrastructure scaling: MosaicML automatically configures the infrastructure you need for LLM training and inference. You don't have to worry about inconveniences like requesting GPU and server quotas from cloud providers, searching for available GPUs in different zones, and deploying the training software on them.
  • Multiple cloud support: MosaicML is cloud-agnostic by being able to run in any Kubernetes cluster. It supports popular public clouds like AWS, GCP, Azure, and CoreWeave. It can also run on your on-premise infrastructure.
  • Distributed training: MosaicML has built-in support for large-scale distributed training and progress monitoring.
  • Fine-tuned model inference: Push your fine-tuned models into production and make them available to your software and clients via application programming interface (API) endpoints.
  • Data storage integration: MosaicML supports data ingestion from all the major storage clouds.
  • Hyperparameter tuning: MosaicML has built-in support for optimal hyperparameter searching.

MosaicML is pretty awesome and really does make it very easy to train any model. One of the most difficult things we work with clients on is the actual deployment of these models. Everyone is eager to train their own LLM for their specific use case, but nobody wants to talk about deployment! Deploying these and managing the infrastructure required is a challenging task on your own. MosaicML lets you do that with costs very similar to AWS.

MPT-30B — MosaicML's Flagship Open-Source LLM

To understand MosaicML's capabilities, you should study the MPT-30B family of foundation models that were trained from scratch using MosaicML.

MPT-30B Features

MPT-30B is a decoder-only, transformer-based, autoregressive, causal language model. Its features include:

  • Permissive license for commercial uses: Unlike the first-generation Llama or its derivatives like Vicuna, MPT-30B is licensed under Apache 2.0, making it genuinely open-source and unencumbered for commercial uses. However, keep in mind that the pre-trained MPT-30B-Instruct and MPT-30B-Chat models have other licenses that may not be ideal for commercial usage.
  • Large context lengths using ALiBi: Instead of positional embeddings which tend to limit context lengths at inference time, MPT-30B uses the attention with linear biases technique to achieve 8,192-token context lengths at inference time.
  • Flash attention optimization: MPT-30B training uses the flash attention technique to optimize attention calculations for GPUs.
  • FasterTransformer for inference: Its code uses the FasterTransformer optimization for faster inference.
  • Inference on a single GPU: The 30-billion parameter size is perfect to fit on a single Nvidia A100 80GB datacenter GPU which is one of the most popular cloud GPU offerings out there.

MPT-30B Demo

MPT-30B claims to outperform OpenAI's GPT-3 on a variety of natural language generation benchmarks:

MosaicML MPT-30B vs. GPT-3
MPT-30B vs. GPT-3 scores (Source: MosaicML)

We assessed MPT-30B's replies qualitatively for common use cases like summarization, customer service chatbot, creative writing, and code generation. It did great in the first three tasks but not in the last task.

The abstract summaries it generates are excellent. In this example, it succinctly summarized the abstract of a research paper:

Abstract summarization with MosaicML

It summarized an entire play, thanks to its 8,192-token context length:

Play summarization with Open Source LLMs

Its conversation capability, a necessity for customer service chatbots, is also good as shown in this example asking for details about banking products:

MosaicML chatbot

Its creativity impresses in this poignant poem:

However, the programming code it generated wasn't generally impressive:

code generation with MosaicML

While it produced syntactically correct code, they were usually functionally incorrect.

MPT-30B Training Dataset

We get an idea of the scale of the training from the datasets MPT-30B was trained on. They consisted of 1 trillion tokens spanning the following large datasets among others:

dataset for training open source LLMs

How to Train MPT-30B From Scratch on MosaicML

In this section, we explain how you can train a massive LLM like the MPT-30B from scratch on the MosaicML platform. We'll explain MosaicML's foundational components like Composer as well as convenience helpers like llm-foundry.

1. Infrastructure Planning

MosaicML manages all infrastructure using Kubernetes (K8s) container orchestration. Since K8s can be deployed on any public cloud or on-prem infrastructure, MosaicML is effectively cloud-agnostic.

You can create any number of K8s clusters with as many GPUs and nodes as your company needs. MosaicML then takes care of right-sizing its infrastructure requests based on the volume of training data.

As a reference point, let's see the training infrastructure needs of the original MPT-30B model:

  • First run on 440 A100-40 GPUs with a batch size of 1760
  • Second run on 216 A100-40 GPUs with a batch size of 1728
  • Final run on 256 H100-80 GPUs with a batch size of 512 with 8,192 context length and 50 billion tokens

These GPU numbers are mind-blowing! It reportedly took 2 months to finish these three training runs. Even though training wasn't round-the-clock but done in three intermittent sessions, it's still an incredible number of GPUs.

While MosaicML provides infrastructure for model deployment, it doesn't provide for training. You'll have to provision GPUs either on public clouds like Amazon or Azure with the necessary bureaucracy like requesting quota hikes. However, the chances of getting dozens of GPUs are quite low due to global GPU shortages. A better strategy is to use specialized GPU clouds like CoreWeave.

2. Prepare Your Training Data

From the dataset table above, it's evident that you'll need anywhere from a few hundred gigabytes to terabytes of training data for such LLMs.

MosaicML can integrate with popular cloud storage solutions like:

  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Any S3-compatible object store
  • Cloudflare R2
  • Backblaze B2

Transfer your training data to any of these storage providers.

3. Setup Data Sharding

Data sharding on MosaicML (Source: MosaicML Streaming)

The next step is to configure sharding for your massive training data for multi-node distributed training across an entire cluster.

For this, MosaicML provides the StreamingDataset abstraction layer that efficiently shards your data from any supported storage and supplies it to a cluster node.

4. Create a MosaicML Training Configuration

A training YAML file provides essential details like the following:

  • Infrastructure resources to provision
  • Location of the training data
  • Training and optimizer algorithms to enable

For example, here's a configuration file for an MPT-1B (1 billion parameters) model:

You can see that it mentions the preferred type of GPUs and the number of GPUs to allocate for this training run. It also describes the commands to execute the training run.

Similarly, the llm-foundry project also provides a deployment configuration for MPT-30B and many other models.

5. Configure Training Optimization Algorithms

MosaicML provides a library of techniques to improve or speed up your training. You can simply configure these techniques in your deployment YAML. Some notable techniques include:

  • ALiBi to enable longer context windows compared to positional embeddings
  • FusedLayerNorm to improve GPU utilization

6. Start the Training Run

Use the mcli command to start the training run based on a given deployment configuration:

The mcli tool initiates the training and prints its progress:

7. Scaling Up and Scaling Out

You can scale up to use more GPUs on the same node, if available. Either modify the "gpus" parameter in the configuration or override it from the command line like this:

MosaicML scales out automatically if the requested number of GPUs exceeds the per-node maximum of the cluster. In this example, if each node has a maximum of eight GPUs, MosaicML automatically provisions three nodes and distributes the training across them:

8. Monitor the Training Run

Use mcli to monitor your runs:

You can also monitor training metrics by enabling TensorBoard logging:

TensorBoard monitoring
TensorBoard monitoring

9. Early Stopping

Using early stopping, you can configure thresholds so that if there are no improvements in key metrics between epochs, the training session is automatically stopped.

10. Saving Checkpoints and Trained Models

As the training is going on, MosaicML creates checkpoints as specified in your configuration:

Checkpointing configuration (Source: MosaicML)

MosaicML stores the checkpoints in your local filesystem, cloud storage endpoint, or a Hugging Face Hub that you've configured. You can use these checkpoints for inference.

How to Fine-Tune MPT-30B Using MosaicML

Fine-tuning an existing model follows the same training steps above. The only difference is that in your configuration file, specify the "predefined" flags:

You can see an example of fine-tuning MPT-30B for instruction-following in the mpt-30b-instruct configuration.

Deploy Your Model for Inference

After training or fine-tuning, MosaicML enables you to deploy your model either on its infrastructure or your preferred cloud infrastructure, depending on your data privacy, data security, and other compliance requirements.

MosaicML's starter edition enables you to deploy your private model on MosaicML's infrastructure and obtain an API endpoint if you're just using one of these base models without any fine-tuning:

  • MPT-7B-Instruct
  • MPT-30B-Instruct
  • Llama2-70B-Chat

For your own fine-tuned models, you need the MosaicML enterprise edition.

Use these steps to deploy your model:

1. Create a Deployment Configuration

The deployment configuration file tells MosaicML details like:

  • Which model should it deploy?
  • What infrastructure is required?
  • How many replicas do you need for redundancy?

An example deployment configuration is shown below:

2. Deploy the Model Using MosaicML

Use the command-line utility to deploy the model. MosaicML does all the provisioning needed to publish it:

The model is automatically deployed and made available at an endpoint.

3. List Your Deployments

Each deployed model is given a unique name by MosaicML. You need that to send requests to the model. Run the utility to list all the deployments:

You'll see your deployments listed:

4. Run Inference

You can submit prompts to the deployed model from your applications. The deployed model is identified by its MosaicML name. Use the following code to submit user prompts and get completions from the LLM:

Efficiently Train an Open Source LLM on MosaicML

In this article, you learned how an LLM can either be trained from scratch or fine-tuned. You also found how you can deploy these models on cloud or on-prem infrastructure.

We use platforms like MosaicML to help our clients go to market with their advanced services that use LLMs to improve their customer service or optimize their business workflows. Contact us to find out how we can do the same for your business!