See GPT-J vs. GPT-3 Go Head-to-Head on Popular Language Tasks

Karthik Shiraly
August 21, 2023

OpenAI's GPT-3 and ChatGPT may have stolen the limelight, but by no means are they the only players in the large language model (LLM) and generative ai ecosystem. You can instead run powerful open-source alternatives like GPT-J more efficiently and reliably on your own servers. They're capable of all the same use cases like blog generation, customer service chatbots, internal search, question-answering, and more. As we’ll see below these open source models perform much better than you think! Let’s take a look at what GPT-J is, and how GPT-J vs. GPT-3 fare on language processing tasks.

GPT-3 pipeline for ai marketing copy generation
We build cool LLM pipelines like this

What Is GPT-J?

GPT-J is an open-source, open-access large language model that came out of experiments at EleutherAI to train massive models on the scale of OpenAI's GPT-3. With 6 billion parameters, GPT-J isn't the biggest but is above average and bigger than their older GPT-Neo models. Eleuther's own newer GPT-NeoX with 20 billion parameters and others like BLOOM with 176 billion parameters outclass GPT-J. For comparison, GPT-3 has about 175 billion parameters.

The main benefit of GPT-J is that its model and code are available to everyone to customize and deploy on consumer hardware or private cloud infrastructure.

GPT-J, like GPT-3 and GPT-2, is an autoregressive model consisting of just the decoder of the standard transformer model. Like GPT-3, it's a causal language model (LM), meaning that its predictions are entirely based on the words and context that came earlier. In contrast, masked language models use context from both sides. GPT-J is autoregressive because, at every step, it uses the previous prediction as input for the next prediction.

Unidirectionality is not a weakness. How causal LMs work is closer to how people use languages, making them a better fit for natural-sounding text generation. In contrast, the ability of masked LMs to use both sides for context makes them better for document understanding tasks.

GPT-J vs. GPT-3 Training

A critical factor behind the power of these LLMs is the data they're trained on and the format of the text sequences relative to how users use it. One of the reasons ChatGPT (different from GPT-3 discussed here) is so much better at Q&A than smaller models is because it is trained specifically on human reviewed Q&A pairs in a very similar format to how you use it in the web interface. 

GPT-J was trained on an 825-gigabyte collection called the Pile that consists of 22 high-quality datasets from a diverse set of sources that enable its use in multiple domains for various use cases. Some of the sources are:

  • The Common Crawl project for data from around the web
  • PubMed and ArXiv for research papers
  • GitHub for code generation capabilities
  • Stack Exchange and HackerNews for high-quality discussions
  • FreeLaw Project for legal texts

GPT-3 was also trained on the Common Crawl data and on WebText2, Books1, Books2, and Wikipedia among others. This wide range of text sources and various language formats is a key part of why GPT-J can be used across so many use cases. 

By seeing the structure and statistics of text in these datasets, the parameters of these LLMs learn to mimic natural-sounding and semantically consistent text. A visible consequence of this is the ability to discern abstract patterns in the text just like people do, giving them powerful few-shot capabilities on many natural language processing (NLP) tasks.

How Do You Use GPT-J or GPT-3 for Tasks Like Classification or Summarization?

intent classification with GPT-3 and GPT-J

When you read the above text, you immediately notice patterns at multiple levels. You see the obvious patterns in the structure. Then you sense higher-level abstract patterns like analogies between each line and its next. By applying these analogies, you're able to predict the next word. This is called few-shot learning.

Since they are trained on large volumes of human-produced text, LLMs too learn to sense these patterns. We can use those abilities for natural language tasks through two mechanisms:

  • In-prompt guidance with instructions and examples
  • Fine-tuning

In-Prompt Guidance

Prompt example of intent classification

This technique supplies the LLM with examples of what a correct result is for your specific use case, and how to get there with various patterns and instructions. In some natural language generation use cases where the correctness of the output is a bit more relative to the user these few shot examples are even more critical to help the model understand the key details in the goal output.

  • If you want it to classify text, supply some text-label pairs.
  • If you want to ask questions and get answers, supply the context and a few pairs of questions and answers.
  • If you want it to summarize, supply the long-form text and summaries as pairs. 

Such inputs with examples and instructions are called "prompts." The patterns and analogies in the examples influence the LLM to predict text that's consistent with them. It's not genuine learning but more like mimicking the abstract relationships in the examples. Surprisingly, that's good enough for many NLP use cases!

Just remember that, by default, a GPT model won't learn or remember anything from a previous prompt. So if you want to classify or summarize many things, you must supply these examples for every inference. We use things like dynamic prompt optimization to provide these LLMs with optimized prompts and examples each time they run.  

Another drawback of in-prompt guidance is the length restriction on prompts. Every LLM's input capacity is limited to just a few thousand tokens (tokens here are not even full words but fragments, often two or three to a full word). GPT-J and most GPT-3 models restrict to 2,048 tokens while the latest GPT-3 Davinci model supports 4,000 tokens. But what if your use case is something like document summarization or blog generation with too much text to fit in a length-restricted prompt? Providing more than one or two few-shot examples is impossible for those use cases. For them, you'll have to look into fine-tuning or zero shot with stronger prompt language.

Fine-Tuning GPT-J vs. GPT-3

Like any other neural network, you can fine-tune an LLM's network weights towards specific goals by retraining some of its layers. If you want the LLM to sound like a legal or medical expert, you can fine-tune it on relevant case files or medical records. Given enough data, its network weights will change to generate text consistent with the domain's text.

Both GPT-J and GPT-3 support fine-tuning. Since GPT-J is a regular neural network model, you can fine-tune it like any other network by unfreezing its layers. You have full control over the process, like the layers to unfreeze and optimization algorithms to use.

In contrast, GPT-3 doesn't let you access its models directly. It provides a basic application programming interface (API) that you can use to fine-tune. But:

  • The models that can be fine-tuned are not the latest ones like "text-davinci-003" but their older baseline versions.
  • The cost per token for using a fine-tuned model can go up as much as 6x compared to the baseline models. The cost per token for training a fine-tuned model is also quite high.
  • You don't get deep control over the fine-tuning process.
  • Plus, we have observed that fine-tuning requests can often sit in the queue for hours without starting. GPT-3 still suffers from some scaling and availability problems that may become bottlenecks to your business if you rely too heavily on it.

In the next three sections, you'll see how these approaches and models fare on applied NLP tasks.

GPT-J vs. GPT-3 for Prompt-Guided Text Classification

Let's see if GPT-J can hold its own against the four GPT-3 models on few-shot classification using in-prompt guidance.

The Task

The goal was to test GPT-J versus all four GPT-3 models on few-shot intent recognition using the banking77 dataset. The dataset contains short online chat queries like these from bank customers:

categories used in intent classification task

They're categorized under 77 fine-grained categories:

10 labels we're going to use for intent classification

We selected 10 labels at random for this test. A large number of categories makes this a difficult task for even trained models. Can the GPTs do well using just few-shot guidance? Also, notice that the labels aren't proper phrases but programming strings joined by underscores. We expect the GPTs to predict them verbatim.

Few-Shot Prompt Preparation

OpenAi tokenizer tool
OpenAI's tokenizer tool

The challenge here is to achieve the best classification metrics without crossing the prompt's limit of 2,048 tokens. The unknown parameter is the optimal number of examples per label for this dataset. Too few means our metrics may be sub-optimal. But too many means we may be overfitting and getting unreliable metrics.

So we conduct the experiment over multiple cycles using split testing of prompts:

  • We start with three examples per label in the first cycle. 
  • We add a validation sample one by one and ask the model to classify after each addition.
  • After all the validation samples are classified, we calculate the metrics for the first cycle.
  • Then we start the next cycle by incrementing the number of training samples per label, repeat the classifications for all validation samples, and recalculate the metrics.
  • We repeat this cycle and increment the samples as long as the cycle's metrics are improving.
  • At some point, either the metrics start to falter, or we hit the prompt limit of 2,048 tokens and can't increment any more.
  • At that point, we have the optimal number of samples per label and use its metrics as the model's metrics.
Examples of intent classification

The selected samples and labels were combined with delimiters to create the few-shot prompt above. You can immediately spot the structural patterns. Quite impressively, even the GPTs can, without any instructions or headers.

inference time labels

Next, we selected about 30 samples from the validation set, and appended each of them to our few-shot prompt, as shown above, to get 30 prompts for inference.

GPT-3 Inference

Code to run through GPT-3 models

We passed the 30 prompts to the completion endpoint of the four GPT-3 models using OpenAI's Python library. The parameters below were set because we want them to generate just one of the 10 labels and nothing else:

  • Set the temperature very low or even zero to minimize the chances of generating random stuff. 
  • Set max_tokens to the token count of the longest label.
  • Set the stop sequences to the sample delimiter used above in the prompt.
  • Initially, we had set the logit_bias map to 100 for any token in the 10 labels. That's supposed to ensure that only those tokens are generated. But when it gave unexpected output, we disabled the bias map. That worked better.

GPT-J Inference

Forefront.ai api call with GPT-J

We passed the same 30 prompts and nearly the same parameters to the GPT-J model API hosted by a popular nlp cloud API called Forefront.

Metrics and Evaluation

evaluation of the results of gpt-3 vs gpt-j

As you’ve probably seen before and been told, it’s often found that open models don't do too well right out of the box. Because they’re smaller, they’re seen as less able to complete any task in many people's minds. Some tweaking and fine-tuning is needed every time to compete with the larger models. We fully expected bad labels here. So we were blown away when GPT-J not only generated only valid labels (with underscores) but also matched Babbage's metrics and was closer to Curie than Ada was.

evaluation results

Where did GPT-J have a performance gap? The failed samples above show that they're quite ambiguous even for us. Even Curie tripped up on the same sample. All said and done, GPT-J, out-of-the-box, fared surprisingly well in head-to-head fine-grained text classification against GPT-3.

GPT-J vs. GPT-3 for Document Summarization

The next experiment was document summarization. Few-shot prompt guidance was impractical because the documents and summaries were too long to provide more than one or two examples. So we fine-tuned GPT-J using Forefront’s fine-tuning interface and GPT-3's Curie using OpenAI's tools.

The Dataset

document summarization dataset

We used the Multi-LexSum legal dataset that provides three different expert-authored summaries for court cases — a long summary, a short one, and a tiny one. The plan was to use the long summaries as our "documents" and the short ones as ideal summaries for fine-tuning. Then given an unseen document (i.e., a long summary), the GPTs must generate something close to its short summary or at least contain all the same information.

GPT-3's fine-tuning guide recommends at least 500 samples while GPT-J says at least 200 samples. We used the same 500 samples for fine-tuning both. The samples must be javascript object notation line (JSONL) records with two fields, prompt and completion, per record. There are particular rules for delimiters at the end of both and a space at the start of the completion.

Fine-Tuning GPT-3

fine-tuning gpt-3 model for document summarization

After uploading the training and validation JSONL files using OpenAI API, we launched the fine-tuning task using their command-line tool.

Fine-tune queue for openai

Unfortunately, it just stayed in the queue for almost 4-5 hours. But once started, it finished in 10 minutes.

Fine-Tuning GPT-J

gpt-j fine-tuning in forefront.ai
GPT-J fine-tuning on Forefront

For fine-tuning GPT-J, we uploaded the same training JSONL file and used Forefront's web frontend. We can use the API in theory but we found some things not working as expected. Fine-tuning for two epochs and five checkpoints with evaluation took about 15 minutes.

test prompts in Forefront.ai

In the images above, you can observe how the generated summary improved between first and last checkpoints. It's short only because the length validation parameter was wrongly set to just 64, though the ideal length must be slightly more than the longest ideal generation.


evaluation of different legal document summarizations

Let's compare the summaries:

  • Expert-written reference summary: It tells us six details — the plaintiffs, the defendant, the core legal issue, the court of filing, the date of filing, and the current status.
  • Baseline GPT-J: Its summary only included two of the six details — the current status and the date. The other four important details are missing. Plus, it's added irrelevant text from some other case, rendering the summary useless for practical use.
  • Fine-tuned GPT-J: In contrast, fine-tuned GPT-J includes five out of six details — the plaintiffs, the defendant, the core legal issue, the court of filing, and the date of filing. The only detail missing is the court's decision. Also notice that its summary is quite close to Curie's summaries.

From this qualitative comparison, it's clear that fine-tuning improved GPT-J's ability to summarize better and include most of the same salient details as the human expert. You can even add a bit of advanced entity extraction to this pipeline to improve the models understanding of key information for a more extractive summarization output. This is a perfect example of why we tell customers that data is more important than the size of the model you choose. Most customers scoff at the idea of using a model other than Davinci GPT-3 simply because they believe size of the model equals more accuracy for their use case. While the size of the model certainly affects the models ability to learn tasks in a low data environment (few-shot learning), fine-tuning these task agnostic language models to your specific data and use case completely changes the equation on which model should be chosen. 

Surprisingly, both baseline and fine-tuned Curie generate almost identical summaries. Is baseline Curie that good, or is it possible that Curie had already seen this data during its training?

GPT-J for Text Generation

This last experiment was to explore GPT-J's ability to sound like a domain expert by training it on domain-specific data. This is useful for businesses to create bots that can provide domain-specific autocomplete suggestions, explain complex internal processes to their employees, or explain product features to their customers. Since it fundamentally modifies the model's text generation abilities, fine-tuning is the way to go.

We trained it on the same Multi-LexSum dataset as the summarization experiment. For fine-tuning text generation, we must leave the JSONL prompt fields empty and only supply the completion fields. We supplied 500 long summaries as completion fields. The idea was to turn it into a kind of legal expert. The fine-tuning process was the same as done before for summarization.

However, as the generated text above shows, both the base text generation model and the fine-tuned model hallucinated unrealistic statements. In hindsight, this was because this dataset is not suitable for text generation at all. It's full of case-specific details with no general theory or explanations.

The ideal datasets for this task are things like knowledge bases of products or theoretical works on a subject.

GPT-J Deployment Choices

GPT-3 is OpenAI's proprietary LLM that you can only access via their API. In contrast, an open-access LLM like GPT-J gives you more deployment freedom. Some of the choices include:

  • Self-hosting: The GPT-J-6B model is heavy on memory and GPU. You may need a machine with at least 48 GB RAM, a minimum of 12-16 GB GPU RAM for inference, and about 90 GB GPU RAM for training. These are uncommon in customer hardware, but you may get dedicated hardware that meets these requirements.
  • NLP-focused managed services: NLP-focused managed services like Forefront.ai, NLP Cloud, or GooseAI provide managed GPT-J access. Like GPT-3, they use APIs and you don't get direct access to the model. But their convenience, pricing, and self-tuning workflows may make them good choices for your needs.
  • GPU-focused cloud providers: GPU-focused providers like Paperspace, CoreWeave, or Banana.dev support GPT-J hosting and workflows. They may be good alternatives to self-hosting.
  • General cloud providers: General providers like Amazon's SageMaker and Google Cloud provide cloud servers that are capable of hosting and fine-tuning GPT-J. If your business is already using them, this may be a good approach.

This is one of key benefits of using these open source models. This control over the infrastructure used to inference your model gives you the ability to adjust the speed of inference time with much more flexibility than managed services. 

Extending GPT-J vs. GPT-3

An LLM may be a very capable text generator, but its knowledge and abilities are limited by what it saw during training. However, many use cases require the model to be knowledgeable about current real-world events. For example, law firms may require their LLM assistants to stay up to date with that day’s morning's court appointments and rulings. Or customer service chatbots may need to know the latest product prices or traffic conditions.

LLMs can be immensely useful to many business use cases if we can enable them to see the world in real-time. We'll cover two approaches to such real-world data augmentation.

Data Pipelines In Only a Few Minutes With LangChain

LangChain is a spaCy-like library to create data processing pipelines around LLMs. It integrates tightly with LLM-oriented interfaces like prompts and concepts like agents and statefulness to create powerful data pipelines that use LLMs. You can use it with GPT-3, GPT-J, or any other LLM.

For example, it can retrieve data from an external API, inject them into an LLM's prompt or output, or split and recombine long documents for tasks like summarization. But, useful as it is, LangChain doesn't fundamentally change the working of an LLM. The next approach does.

Equipping LLMs to Sense the World With Toolformer

Toolformer is a paradigm shift because it enables an LLM to learn to use data augmentation tools by itself and make context-aware decisions, like when to use or not use a tool. Such built-in context awareness is far more powerful than rule-oriented approaches like LangChain.

Under the hood, it introduces special tokens (like BERT's CLS token). If the rest of the context activates that token, it executes actions like running scripts, connecting to another neural network, or calling third-party APIs. It then intelligently decides whether the results it receives are relevant to the context and if it should include them in the generated text.

Because Toolformer directly modifies some model fundamentals, you need an open-source one like GPT-J for it. Of course, OpenAI may decide to implement it in the future and provide an API but are you ready to wait with uncertainty? Moreover, you may never be able to use a third-party LLM with your internal APIs. By opting for self-hosted ones like GPT-J, you can integrate your proprietary APIs or data sources immediately to provide unique services to your customers.

Deploy GPT-J for Your Business Workflows

Internal knowledge search engines, customer service help, document summarization, question-answering, or text classification are all some of the new possibilities that artificial intelligence has opened up via powerful large language models like GPT-J. Contact us for ideas on how to integrate them into your workflows to improve your business efficiency.