Training a large language model (LLM) from scratch is not a trivial undertaking. The volume of data and the computing resources required involve all kinds of inherent and emergent complexities.
In this article, find out how to train an open source LLM on MosaicML, a platform that specializes in training LLMs and managing their training complexities.
What Is MosaicML?
MosaicML is a platform and its software for efficiently training large language models and deploying them for inference. It provides all the logic and tools you need to set up massively distributed training runs on very large datasets.
What LLM Problems Does MosaicML Solve?
Training LLMs involves several problems like:
GPU availability
Scaling out
Stack complexity
Fault tolerance
MosaicML addresses all these problems with features like:
Automated infrastructure scaling: MosaicML automatically configures the infrastructure you need for LLM training and inference. You don't have to worry about inconveniences like requesting GPU and server quotas from cloud providers, searching for available GPUs in different zones, and deploying the training software on them.
Multiple cloud support: MosaicML is cloud-agnostic by being able to run in any Kubernetes cluster. It supports popular public clouds like AWS, GCP, Azure, and CoreWeave. It can also run on your on-premise infrastructure.
Distributed training: MosaicML has built-in support for large-scale distributed training and progress monitoring.
Fine-tuned model inference: Push your fine-tuned models into production and make them available to your software and clients via application programming interface (API) endpoints.
Data storage integration: MosaicML supports data ingestion from all the major storage clouds.
Hyperparameter tuning: MosaicML has built-in support for optimal hyperparameter searching.
MosaicML is pretty awesome and really does make it very easy to train any model. One of the most difficult things we work with clients on is the actual deployment of these models. Everyone is eager to train their own LLM for their specific use case, but nobody wants to talk about deployment! Deploying these and managing the infrastructure required is a challenging task on your own. MosaicML lets you do that with costs very similar to AWS.
MPT-30B — MosaicML's Flagship Open-Source LLM
To understand MosaicML's capabilities, you should study the MPT-30B family of foundation models that were trained from scratch using MosaicML.
MPT-30B Features
MPT-30B is a decoder-only, transformer-based, autoregressive, causal language model. Its features include:
Permissive license for commercial uses: Unlike the first-generation Llama or its derivatives like Vicuna, MPT-30B is licensed under Apache 2.0, making it genuinely open-source and unencumbered for commercial uses. However, keep in mind that the pre-trained MPT-30B-Instruct and MPT-30B-Chat models have other licenses that may not be ideal for commercial usage.
Large context lengths using ALiBi: Instead of positional embeddings which tend to limit context lengths at inference time, MPT-30B uses the attention with linear biases technique to achieve 8,192-token context lengths at inference time.
Flash attention optimization: MPT-30B training uses the flash attention technique to optimize attention calculations for GPUs.
FasterTransformer for inference: Its code uses the FasterTransformer optimization for faster inference.
Inference on a single GPU: The 30-billion parameter size is perfect to fit on a single Nvidia A100 80GB datacenter GPU which is one of the most popular cloud GPU offerings out there.
MPT-30B Demo
MPT-30B claims to outperform OpenAI's GPT-3 on a variety of natural language generation benchmarks:
We assessed MPT-30B's replies qualitatively for common use cases like summarization, customer service chatbot, creative writing, and code generation. It did great in the first three tasks but not in the last task.
The abstract summaries it generates are excellent. In this example, it succinctly summarized the abstract of a research paper:
It summarized an entire play, thanks to its 8,192-token context length:
Its conversation capability, a necessity for customer service chatbots, is also good as shown in this example asking for details about banking products:
Its creativity impresses in this poignant poem:
However, the programming code it generated wasn't generally impressive:
While it produced syntactically correct code, they were usually functionally incorrect.
MPT-30B Training Dataset
We get an idea of the scale of the training from the datasets MPT-30B was trained on. They consisted of 1 trillion tokens spanning the following large datasets among others:
How to Train MPT-30B From Scratch on MosaicML
In this section, we explain how you can train a massive LLM like the MPT-30B from scratch on the MosaicML platform. We'll explain MosaicML's foundational components like Composer as well as convenience helpers like llm-foundry.
1. Infrastructure Planning
MosaicML manages all infrastructure using Kubernetes (K8s) container orchestration. Since K8s can be deployed on any public cloud or on-prem infrastructure, MosaicML is effectively cloud-agnostic.
You can create any number of K8s clusters with as many GPUs and nodes as your company needs. MosaicML then takes care of right-sizing its infrastructure requests based on the volume of training data.
First run on 440 A100-40 GPUs with a batch size of 1760
Second run on 216 A100-40 GPUs with a batch size of 1728
Final run on 256 H100-80 GPUs with a batch size of 512 with 8,192 context length and 50 billion tokens
These GPU numbers are mind-blowing! It reportedly took 2 months to finish these three training runs. Even though training wasn't round-the-clock but done in three intermittent sessions, it's still an incredible number of GPUs.
While MosaicML provides infrastructure for model deployment, it doesn't provide for training. You'll have to provision GPUs either on public clouds like Amazon or Azure with the necessary bureaucracy like requesting quota hikes. However, the chances of getting dozens of GPUs are quite low due to global GPU shortages. A better strategy is to use specialized GPU clouds like CoreWeave.
2. Prepare Your Training Data
From the dataset table above, it's evident that you'll need anywhere from a few hundred gigabytes to terabytes of training data for such LLMs.
MosaicML can integrate with popular cloud storage solutions like:
Amazon S3
Google Cloud Storage
Azure Blob Storage
Any S3-compatible object store
Cloudflare R2
Backblaze B2
Transfer your training data to any of these storage providers.
The next step is to configure sharding for your massive training data for multi-node distributed training across an entire cluster.
For this, MosaicML provides the StreamingDataset abstraction layer that efficiently shards your data from any supported storage and supplies it to a cluster node.
For example, here's a configuration file for an MPT-1B (1 billion parameters) model:
You can see that it mentions the preferred type of GPUs and the number of GPUs to allocate for this training run. It also describes the commands to execute the training run.
MosaicML provides a library of techniques to improve or speed up your training. You can simply configure these techniques in your deployment YAML. Some notable techniques include:
ALiBi to enable longer context windows compared to positional embeddings
FusedLayerNorm to improve GPU utilization
6. Start the Training Run
Use the mcli command to start the training run based on a given deployment configuration:
The mcli tool initiates the training and prints its progress:
7. Scaling Up and Scaling Out
You can scale up to use more GPUs on the same node, if available. Either modify the "gpus" parameter in the configuration or override it from the command line like this:
MosaicML scales out automatically if the requested number of GPUs exceeds the per-node maximum of the cluster. In this example, if each node has a maximum of eight GPUs, MosaicML automatically provisions three nodes and distributes the training across them:
8. Monitor the Training Run
Use mcli to monitor your runs:
You can also monitor training metrics by enabling TensorBoard logging:
TensorBoard monitoring
9. Early Stopping
Using early stopping, you can configure thresholds so that if there are no improvements in key metrics between epochs, the training session is automatically stopped.
10. Saving Checkpoints and Trained Models
As the training is going on, MosaicML creates checkpoints as specified in your configuration:
MosaicML stores the checkpoints in your local filesystem, cloud storage endpoint, or a Hugging Face Hub that you've configured. You can use these checkpoints for inference.
How to Fine-Tune MPT-30B Using MosaicML
Fine-tuning an existing model follows the same training steps above. The only difference is that in your configuration file, specify the "predefined" flags:
After training or fine-tuning, MosaicML enables you to deploy your model either on its infrastructure or your preferred cloud infrastructure, depending on your data privacy, data security, and other compliance requirements.
MosaicML's starter edition enables you to deploy your private model on MosaicML's infrastructure and obtain an API endpoint if you're just using one of these base models without any fine-tuning:
MPT-7B-Instruct
MPT-30B-Instruct
Llama2-70B-Chat
For your own fine-tuned models, you need the MosaicML enterprise edition.
Use these steps to deploy your model:
1. Create a Deployment Configuration
The deployment configuration file tells MosaicML details like:
Which model should it deploy?
What infrastructure is required?
How many replicas do you need for redundancy?
An example deployment configuration is shown below:
2. Deploy the Model Using MosaicML
Use the command-line utility to deploy the model. MosaicML does all the provisioning needed to publish it:
The model is automatically deployed and made available at an endpoint.
3. List Your Deployments
Each deployed model is given a unique name by MosaicML. You need that to send requests to the model. Run the utility to list all the deployments:
You'll see your deployments listed:
4. Run Inference
You can submit prompts to the deployed model from your applications. The deployed model is identified by its MosaicML name. Use the following code to submit user prompts and get completions from the LLM:
Efficiently Train an Open Source LLM on MosaicML
In this article, you learned how an LLM can either be trained from scratch or fine-tuned. You also found how you can deploy these models on cloud or on-prem infrastructure.
We use platforms like MosaicML to help our clients go to market with their advanced services that use LLMs to improve their customer service or optimize their business workflows. Contact us to find out how we can do the same for your business!