5 Steps To Production Level GPT-3 Language Translation Software

Matt Payne
August 21, 2023
Let’s take a look at the 5 easy steps to start using a production level GPT-3 language translation pipeline for any use case. 

Human language translation with artificial intelligence is a task that was often reserved for those who either worked on Google Translate or had a ton of translation data for all languages to train a large language model. Before the rise of transformer based neural network models, you were stuck building strict LSTM machine learning models that were much slower and less accurate. Transformer based models have grown in popularity for a number of natural language processing tasks including language translation and if you have the data required to train architectures such as T5 you can make progress over time. 

GPT-3 (Generative Pre-trained Transformer) gives you the ability to create language translation software with any level of training data and get extremely good results right out of the box. The underlying GPT-3 model is trained on billions of words from sources such as Wikipedia and learned to predict the next token in a sequence with its 175 billion parameters (no longer the largest language model). Through prompt based programming you can turn the task agnostic autoregressive model into a language translation model that picks up on the task pretty well. 

In this article, we’ll take a look at how we build production level language translation models that can support any number of languages or training data to be as flexible as possible. After all, the key benefit to GPT-3 is its ability to pick up on tasks in just a few examples of what to do, instead of having to retrain huge language models. Here are the steps to take to reach a production level architecture. 

Step 1: Data & Task Understanding for Machine Translation

Building out an understanding of a few high level ideas related to your specific use case is incredibly important to understand what requirements you need to build. Having even a high level answer to these questions makes it much smoother to move from simple GPT-3 prompts to a high level product. Most of these questions are asked for almost any generative pre-trained transformer related architecture.

How Many Languages Do We Want To Support?

Understanding how many languages you want to support translation support for out of the gate is vital. Not only should you know so your input and output systems support the language, but from a data variance standpoint, we will make decisions based on the answer. The more languages you want to support the more data variance coverage your model must cover. Data variance is a high level way of saying how many different variations of the problem could be used as input for the model. This includes the number of languages, length of input text, number of translations to do at one time, etc. 

gpt3 language translation playground prompt
GPT-3 playground example of translating to multiple languages.

How Much Training Data Do We Have?

training data from Europarl
Europarl is a popular machine translation dataset used for task such as this one

The pathway we take down below when actually building our GPT-3 prompt for language translation will to some extent be determined by how much training data we already have. While GPT-3 can get started with any level of translation examples we usually cover more data variance with more examples that show the model how to complete the task. As we can see in the graphic below GPT-3 becomes much more accurate as the number of relevant examples in our prompt increases. This is due to the fact that we have now shown GPT-3 a variety of different inputs and how to complete the task correctly. The more data we have to work with, the easier it is to support different languages and input sources with a production accuracy. 

How Many Data Sources Do We Have and How Much Do They Differ?

Your data variance can grow exponentially as you increase the number of data sources that will be running through your GPT-3 model. As you can imagine there is a different level of understanding required from GPT-3 to translate a single question vs translate an entire research paper. While you can build an architecture that supports both it’s just good to understand that when you start so you don’t build a model that is tightly fit to it’s training data. 

Step 2: Select a GPT-3 Engine

gpt-3 engines

GPT-3 has a number of different underlying engines that you can choose from that can perform tasks like this at a number of different levels. The general idea is that as the models get more expensive per run their ability to understand tasks becomes better. 

Davinci is by far the most popular model used today for almost any GPT-3 task. It is the most capable model and has shown the ability to perform tasks at higher accuracy and with less instruction. This means it needs fewer language examples on average and does not require as strong language to understand the task. Davinci also offers an instruct version that has been fine-tuned on instruction type text to allow it to have an even better understanding of prompt type language. This was once the biggest language model at 175 billion parameters. 

Curie is the next model in line and is about 1/10 the cost of Davinci. While it is generally less powerful than Davinci on most tasks, the Curie model is still strong at tasks such as language translation. This is mostly due to the fact that translation is an easier task than things like complex intent, cause and effect, and key topic extraction. The model is very fast compared to Davinci and is worth looking at as an option when balancing cost vs accuracy. 

We almost always recommend using the Davinci model if you want to reach production level accuracy and data variance coverage. As your product lifecycle moves along and you’ve gathered more translation examples you can look to Curie if cost is an issue. Given that both models can be fine-tuned to reduce per run costs there isn’t as much benefit long term to choosing Curie. 

Step 3: Build Prompt Framework for Running GPT-3

translate the following text to chinese

GPT-3 uses a text prompt to allow you to instruct the model on what the task is that you’re trying to accomplish. The framework for this in production includes:

  1. Text header is used to initiate the models understanding of the task. Helps steer the model toward the language translation task and takes advantage of initial token bias.
  2. Prompt examples are used to help steer the model towards what we consider to be correct. There are many language tasks where the correct result is relative to the user. Different readers of a summarization of a book might argue over what information should be included. These prompt examples will help GPT-3 get an even tighter grasp on the language translation task. 
  3. Prompt instructions are the instructions used to tell GPT-3 what to generate at the end of the text. This will be what we use to decide what language to translate to, how many languages to translate to, and what the general task is. 
  4. Input text is the original text we want to translate. 
gpt-3 prompt example
Playground example that shows all of the above prompt variables

The key component of this step is building out prompt language that gives GPT-3 the best ability to understand our task for a high level of data variance. In situations where we have a ton of translation examples of many different languages or data sources, the underlying GPT-3 model can rely on just the prompt examples without needing to understand the instructions as much. For the most part, it’s very difficult to cover a huge range of high accuracy language translations at scale without well-refined prompt language that gives GPT-3 a great shot at producing a quality result in tough situations. 

As the data variance you want to support at a high level grows your prompt will be strained and tested much more often which can result in users receiving poor results back. Unlike other GPT-3 tasks such as marketing copy generation, you don’t ever want to give back poor results as it’s difficult for the end user to know your results are bad. How are they supposed to know the translation doesn’t make any sense? We spend a ton of time optimizing the prompt variables to reduce the likelihood of a poor translation, no matter if the input text is 20 words or 5,000.

Step 4. Fine-Tuning & Prompt Optimization

sbert logo
We use the machine learning model SBERT as a part of our prompt optimization algorithms

Playground examples of language translation are good and all, but does that really work in production? If GPT-3 has a token limit of how many translation examples you can include in a prompt, how do we support thousands of languages or massive translations? It’s no secret that prompt examples relevant to the input text (examples of French to Chinese translation when our input task is French to Chinese translation) explode your accuracy and GPT-3s ability to understand the task. It makes sense that prompt examples that show GPT-3 how to fulfill the exact task have shown to boost accuracy up to 30% in some use cases. So if there is a token limit how do you include enough examples to cover all these languages?

Dynamically Optimized Prompts

Prompt optimization algorithms are used to put perfectly relevant examples to our exact language translation in our prompt at runtime to optimize the results. These algorithms focus on dynamically creating a prompt that is tailored to the input text to maximize accuracy. This allows you to cover an insanely large amount of data variance in our language translation task and feel much more comfortable with the output translations. We build these prompt optimization algorithms for almost all our GPT-3 tasks, and is the single most important step to moving to a production system

few shot learning example

Language Model Fine-Tuning

Fine-tuning is another important part of the process when talking about moving our accuracy and data variance coverage to production levels. Fine-tuning allows you to generate a custom GPT-3 engine model with one of the engines as a baseline. The key benefit to this approach is there is no token limit when running fine-tuned models. The examples that are added during training are not included in the prompt token count at runtime. This means our GPT-3 model can have a baseline understanding of the task at hand before seeing any prompt examples. While fine-tuning does not provide the per task “steering strength” as added prompt examples, it certainly moves us closer. 

Step 5. Confidence Metrics & Testing Suite for Natural Language Processing

The ability to fully understand how well your pipeline is performing in production as well as having a quantifiable way to see improvements in the accuracy of your product is a requirement. These two modules give you a deep insight into the abilities and quality of your language translation application. 

Confidence Metrics

Confidence metrics are custom built scoring algorithms that give you an idea of the probability that a given translation run is good. We mentioned above that “accuracy” with GPT-3 can be somewhat relative given the generation nature of the outputs. In many use cases, different parties can have completely different ideas of what is correct. This makes it difficult to quantify what a good result is and measure the confidence of the result. For this reason and others, we build our confidence metrics custom to the task at hand using various NLP scoring methods. This allows you to have some level of confidence in the real-time results of your language translation application.

Bleu score used for evaluating translation

Evaluating Language Models

Testing suites are used to quantify improvements during all stages of development and the pipeline's lifecycle. This helps you split test different language variations in your GPT-3 prompt, different prompt optimization frameworks, or variations in the number of models you use in a pipeline. Understanding what changes in your GPT-3 prompt actually lead to better translations is a huge part of making progress in improving your data variance coverage or the number of languages you can support at an acceptable accuracy level. 

Bleu score is one of the most popular algorithms used for evaluating natural language processing pipelines such as language translation, chatbots, and abstract summarization. The key idea behind this algorithm is to compare a generated sentence to a human-created sentence and score how similar they are to each other. Bleu score leverages N-grams and word compared precision to create a final score between 0 and 1 for how close the generated sentence is to the human sentence. A brevity penalty is added to penalize sentences that are too short and keep the 0 to 1 score more relative to any length of sentence. Bleu score has both strengths and weaknesses that should be understood. 

  • Very easy to use and quickly reach some level of quantifiability 
  • The algorithm is language independent which is perfect for our multi language services
  • Works well with multiple ground truth sentences
  • The key weakness is that it does not consider the meaning of words. Contextual similar words such as “police” and “cop” are considered incorrect. Luckily we’ve created a variation of this algorithm to account for this common issue
  • The raw Bleu score does not consider variations of the same word to be the same “run” and “running” are scored as different
  • The algorithm does not have a method of understanding important words and leveraging them differently. Words such as “to” or “it’s” are scored the same as everything else.
  • Variations in how proper nouns are handled which is common in the language industry are difficult to manage

Word Error Rate from Ketan Doshi

Word Error Rate (WER) is another popular algorithm used in language translation and speech to text applications. The algorithm works to compare the generated sentence to a target sentence word by word. The algorithm uses insertions, deletions, and substitutions over the total words in the translation to understand the difference between the two. WER is based on a string comparison algorithm known as Levenstein distance. 


Language translation with GPT-3 is a popular task for people who want to get started with the OpenAi engine. The high level idea is offered as a playground example where you can get your feet wet and see the baseline level power of GPT-3. That being said the process of going from nice playground examples to a full-scale production pipeline is much different and includes a number of steps most users don’t even know exist. The architecture required to go from a proof of concept to the production system we’ve laid out today is much different and requires a real understanding of data variance, prompt based programming, and evaluating large language models to produce human like text.

About Width.ai

Width.ai focuses on building deep learning pipelines just like this one for any domain or business use case. We’ve built this exact architecture to take basic GPT-3 ideas and turn them into full sized businesses that clients can use. We’ve written a number of deep guides to using GPT-3 for various tasks and are the leading experts in developing applications for its use.

width.ai logo