The large language model space has been heating up like never before with new entrants, models, and amazing capabilities announced every month. In this comparison of AI21 vs. GPT-3, we explore one such new entrant, AI21 Labs, and how their Jurassic-1 models fare against OpenAI's GPT-3 models.
What Is AI21?
AI21 is a company that's created large language models called Jurassic-1 and Jurassic-2 on the same scale as OpenAI's GPT-3. Their Jurassic-1 Jumbo model clocks in at 178 billion parameters, comparable to OpenAI's DaVinci model. They also have a Jurassic-1 Grande model that's been trained to follow instructions on the same lines as InstructGPT.
AI21 vs. GPT-3 — Essentials
We'll start with some essential differences between these two model families before we compare them on language tasks.
1. AI21 vs. GPT-3 Features
AI21's main product is called AI21 Studio. It's a web application that features various language tasks like summarization, paraphrasing, ad copy generation, and more. Even non-technical users will find it easy to use for common text generation and processing tasks. Plus, AI21 Studio also exposes application programming interfaces (APIs) for programmatic access from your custom software.
Here are some key differences in features and functionality between AI21 Studio and GPT-3:
Predefined tasks and prompts: AI21 Studio's presentation of tasks and prompts is more welcoming towards non-technical users. Users are shown easy-to-understand tasks and descriptions, like customer service bot or outline creator, and directly taken to the playground where they can start using it. While GPT-3 also has such tasks, they're very basic and don’t provide much real world value.
User experience: AI21 Studio's user interface feels like it's designed for non-technical users. In contrast, GPT-3's playground feels more like a quick prototyping front-end meant for developers.
API differences: For most common language tasks with GPT-3, you must use the completions or chat APIs. In contrast, AI21 Studio provides task-specific APIs in addition to general completion API.
2. AI21 vs. GPT-3 Models
AI21's models are called Jurassic-1 and are based on the same autoregressive transformer neural network as GPT-3. Some brief details about AI21’s models:
Jurassic-1 Jumbo, with 178 billion parameters, is equivalent to GPT-3 DaVinci with its 175 billion.
Jurassic-1 Grande, with 17 billion parameters, has no equivalent in the GPT-3 family.
Jurassic-1 Grande Instruct, with 17 billion parameters, is functionally equivalent to the ChatGPT chatbot.
Jurassic-1 Large, with 7.5 billion parameters, is equivalent to GPT-3 Curie with its 6.7 billion.
Jumbo costs $0.25/1,000 completion tokens but doesn't charge anything for prompts which is helpful when using few-shot examples. GPT-4 charges 3-6 cents for 1,000 prompt tokens and 6-12 cents for 1,000 completion tokens. GPT-3 DaVinci costs 2 cents for 1,000 prompt plus completion tokens, more than one-tenth cheaper.
Other AI21 models are similarly more expensive.
But AI21 is much more generous with free credits, valid for 3 months, for new accounts.
Is AI21 worth it? Let's find out by running them against each other on multiple language tasks.
Language Tasks for AI21 vs. GPT-3 Evaluation
We evaluate the models on four basic natural language processing and text generation tasks with practical uses for businesses and education:
Extractive general summarization
Legal summarization
Ad copy generation
Math proof generation
1. Zero-Shot Strict Summarization
Zero-shot strict summarization is the ability of a language model to do extractive summarization according to a given set of rules. This is useful in medical and legal fields where extraction is preferred because abstractive summarization may inadvertently change the meaning.
Zero-shot summarization of technical articles is challenging because they contain domain-specific concepts and external knowledge familiar to a specialized audience. Any summary must sound correct to such an audience.
The gist here is that spaCy provides an advanced text processing pipeline that's superior to the more common approach of using multiple frameworks with glue code.
The temperature is set to zero for all models to remove any randomness in the predictions. We qualitatively evaluate the summaries as follows:
Is it coherent? Does it convey the gist or the main idea(s) of the article?
Is it strictly extractive?
Does it follow the other conditions like a word limit and having a proper noun?
GPT-3 Models
The main prompt used was "write a strictly extractive summary with at least one proper noun. Do not use more than 10 words." But the GPT-3 models seemed unable to understand "strictly extractive summary" and generated abstractive summaries.
Only an alternative prompt that explained how to do extractive summarization without using that term — "select one sentence from the passage that summarizes it" — came somewhat close. But even then, it wasn't strictly extractive.
Surprisingly, ChatGPT fared badly on the main prompt and became verbose on the alternative prompt.
Jurassic-1 Models
The j1-grande-instruct model fared quite well on this task. It understood the main prompt and produced a nearly-verbatim sentence from the passage. The alternative prompt was perfectly understood.
However, the non-instruct models produced mostly junk output.
Observations
AI21's instruct model does better at extractive summarization than all the GPT-3 models if the text has a single main theme.
2. Legal Clause Extractive Summarization
Summarizing important sections and extracting information from documents is useful for automated document understanding and verification of contracts, court cases, and other legal documents. For example, we have implemented summarization on many legal use cases like understanding master service agreements and improving legal contracts.
In this experiment, we evaluate whether large language models can generate such information without any few-shot training data. These sections from a contract are supplied to the models:
GPT-3 Models
We instructed the DaVinci model to generate a summary and extract important information. However, it appears that it's only able to do one of them at a time:
Splitting up the instructions into two sentences works better. The summary is abstractive but retains all important information. Interestingly, more information is extracted in response to the standalone prompt (above) and it's different from that for the mixed prompt (below).
ChatGPT generates a fairly extractive summary but its named entities are not quite what we want
Jurassic-1 Models
Unfortunately, the Jurassic-1 models didn't do well on this task, both in summary generation and information extraction.
Observations
GPT-3 is capable of processing legal documents out of the box. However, AI21 models show the following problems:
Though the summaries are extractive, they sometimes miss out on entire sections.
The model just isn't capable of extracting information without training.
3. Ad Copy Generation
You can use language models as a copywriter to automatically generate new ads for your products based on their descriptions.
Example 1: Ad Headline
For this task, we start with the following product information as input:
GPT-3 Models
The GPT-3 models did quite well but Curie and ChatGPT didn't always include the brand name. When a new field called "brand" was added and explicitly instructed to mention it, even Curie and ChatGPT started including it:
Jurassic-1 Models
The AI21 models generated fairly good headlines and followed other instructions like word limits:
Example 2: Ad Body
For this experiment, we selected this random product from the cross-market recommendations dataset and asked the models to generate an ad body for it:
GPT-3
Good ads are eye-catching and exciting. When we asked GPT-3 to produce such ads with specific language to use, it did well:
Jurassic-1
We gave the same instructions to the j1-grande-instruct. While it did OK, it didn't really come up with the kind of excited copy GPT-3 generated and didn't follow some instructions either.
Observations on Ad Generation
Both models did quite well on this task. GPT-3 was effortlessly creative when the prompts asked for it while AI21 remained quite boring throughout.
We also noticed that the AI21 model sometimes duplicated all the results when asked for a particular number of results. If you're planning to automate this stuff, perhaps to distribute ads automatically to Google or Facebook, ensure that you have some post-generation checks.
4. Math Proof Generation
Large language models are capable of generating proofs for simple math theorems. In this experiment, we examine how AI21 and GPT-3 perform on proof-by-cases using few-shot training.
First, we provided an example of proof from number theory. The proof uses slightly complex concepts and symbols, such as modulo, equivalence, and powers. The models are then asked to prove that the absolute value of any number outside (-1,1) will exceed one.
GPT-3 Models
Despite just a single few-shot example, the DaVinci model is able to follow the familiar pattern of proof by cases and prove the theorem without any complications.
Jurassic-1 Models
The j1-grande-instruct model demonstrates some quirks in its proof:
Though asked to prove that |x| > 1, it proves instead that |x| ≥ x.
It introduces an additional variable k though it's not necessary.
While its method produced an acceptable deduction, it's not exactly what was asked.
Few-Shot Training for AI21
Can AI21 perhaps avoid its quirks with some few-shot training? To test that, we gave it some few-shot examples of proof by case:
Unfortunately, the proofs it generated every time were all garbage, regardless of the temperature value:
Observations
The AI21 model isn't as good as the GPT-3 model on math proof generation. More few-shot examples may help with the output format but it doesn't look like its core math reasoning is capable enough as of now.
AI21 vs. GPT-3: Which LLM Should You Choose?
AI21's web interface is definitely more feature-rich, making it suitable for non-technical users. It did quite well on general summarization tasks.
But when more creativity was required (like for ad copy) or more domain-specific behavior was asked (for legal documents), it didn't do so well out of the box. So we can conclude that GPT-3 is overall more versatile than AI21 as of now.
Large Language Models Are Increasing in Capabilities
Large language models are getting more capable all the time. At the time of this writing, a new 540 billion parameter model capable of using language and robotic sensors together promises new breakthroughs. This is an incredible time for businesses to come on board this ecosystem and use these versatile models for their workflows. Contact us to find out how you can start using these models to make your work easier.