AI21 vs. GPT-3: Head-to-Head on Practical Language Tasks

August 15, 2023

The large language model space has been heating up like never before with new entrants, models, and amazing capabilities announced every month. In this comparison of AI21 vs. GPT-3, we explore one such new entrant, AI21 Labs, and how their Jurassic-1 models fare against OpenAI's GPT-3 models.

What Is AI21?

AI21 is a company that's created large language models called Jurassic-1 and Jurassic-2 on the same scale as OpenAI's GPT-3. Their Jurassic-1 Jumbo model clocks in at 178 billion parameters, comparable to OpenAI's DaVinci model. They also have a Jurassic-1 Grande model that's been trained to follow instructions on the same lines as InstructGPT.

AI21 vs. GPT-3 — Essentials

We'll start with some essential differences between these two model families before we compare them on language tasks.

1. AI21 vs. GPT-3 Features

AI21's main product is called AI21 Studio. It's a web application that features various language tasks like summarization, paraphrasing, ad copy generation, and more. Even non-technical users will find it easy to use for common text generation and processing tasks. Plus, AI21 Studio also exposes application programming interfaces (APIs) for programmatic access from your custom software.

Here are some key differences in features and functionality between AI21 Studio and GPT-3:

Predefined tasks and prompts: AI21 Studio's presentation of tasks and prompts is more welcoming towards non-technical users. Users are shown easy-to-understand tasks and descriptions, like customer service bot or outline creator, and directly taken to the playground where they can start using it. While GPT-3 also has such tasks, they're very basic and don’t provide much real world value.
User experience: AI21 Studio's user interface feels like it's designed for non-technical users. In contrast, GPT-3's playground feels more like a quick prototyping front-end meant for developers.
API differences: For most common language tasks with GPT-3, you must use the completions or chat APIs. In contrast, AI21 Studio provides task-specific APIs in addition to general completion API.

2. AI21 vs. GPT-3 Models

AI21's models are called Jurassic-1 and are based on the same autoregressive transformer neural network as GPT-3. Some brief details about AI21’s models:

Jurassic-1 Jumbo, with 178 billion parameters, is equivalent to GPT-3 DaVinci with its 175 billion.
Jurassic-1 Grande, with 17 billion parameters, has no equivalent in the GPT-3 family.
Jurassic-1 Grande Instruct, with 17 billion parameters, is functionally equivalent to the ChatGPT chatbot.
Jurassic-1 Large, with 7.5 billion parameters, is equivalent to GPT-3 Curie with its 6.7 billion.

3. AI21 vs. GPT-4 and GPT-3 Pricing

AI21 pricing is more expensive than GPT-4 or GPT-3:

Jumbo costs $0.25/1,000 completion tokens but doesn't charge anything for prompts which is helpful when using few-shot examples. GPT-4 charges 3-6 cents for 1,000 prompt tokens and 6-12 cents for 1,000 completion tokens. GPT-3 DaVinci costs 2 cents for 1,000 prompt plus completion tokens, more than one-tenth cheaper.
Other AI21 models are similarly more expensive.
But AI21 is much more generous with free credits, valid for 3 months, for new accounts.

Is AI21 worth it? Let's find out by running them against each other on multiple language tasks.

Language Tasks for AI21 vs. GPT-3 Evaluation

We evaluate the models on four basic natural language processing and text generation tasks with practical uses for businesses and education:

Extractive general summarization
Legal summarization
Ad copy generation
Math proof generation

1. Zero-Shot Strict Summarization

Zero-shot strict summarization is the ability of a language model to do extractive summarization according to a given set of rules. This is useful in medical and legal fields where extraction is preferred because abstractive summarization may inadvertently change the meaning.

Zero-shot summarization of technical articles is challenging because they contain domain-specific concepts and external knowledge familiar to a specialized audience. Any summary must sound correct to such an audience.

For this experiment, we asked the models to summarize this section from our technical article on the spaCy text processing framework:

AI21 vs GPT-3: SpaCy text processing framework

The gist here is that spaCy provides an advanced text processing pipeline that's superior to the more common approach of using multiple frameworks with glue code.

The temperature is set to zero for all models to remove any randomness in the predictions. We qualitatively evaluate the summaries as follows:

Is it coherent? Does it convey the gist or the main idea(s) of the article?
Is it strictly extractive?
Does it follow the other conditions like a word limit and having a proper noun?

GPT-3 Models

The main prompt used was "write a strictly extractive summary with at least one proper noun. Do not use more than 10 words." But the GPT-3 models seemed unable to understand "strictly extractive summary" and generated abstractive summaries.

Only an alternative prompt that explained how to do extractive summarization without using that term — "select one sentence from the passage that summarizes it" — came somewhat close. But even then, it wasn't strictly extractive.

AI21 vs GPT-3: davinci and curie extractive summary of SpaCy's API

Surprisingly, ChatGPT fared badly on the main prompt and became verbose on the alternative prompt.

AI21 vs GPT-3: ChatGPT extractive summary of SpaCy's API

Jurassic-1 Models

The j1-grande-instruct model fared quite well on this task. It understood the main prompt and produced a nearly-verbatim sentence from the passage. The alternative prompt was perfectly understood.

AI21 vs GPT-3: j1-grande-instruct extractive summary of SpaCy

However, the non-instruct models produced mostly junk output.

Observations

AI21's instruct model does better at extractive summarization than all the GPT-3 models if the text has a single main theme.

2. Legal Clause Extractive Summarization

Summarizing important sections and extracting information from documents is useful for automated document understanding and verification of contracts, court cases, and other legal documents. For example, we have implemented summarization on many legal use cases like understanding master service agreements and improving legal contracts.

In this experiment, we evaluate whether large language models can generate such information without any few-shot training data. These sections from a contract are supplied to the models:

AI21 vs GPT-3: screenshot of sections 4.1 and 4.2

GPT-3 Models

We instructed the DaVinci model to generate a summary and extract important information. However, it appears that it's only able to do one of them at a time:

AI21 vs GPT-3: screenshot of the davinci model standalone prompt

Splitting up the instructions into two sentences works better. The summary is abstractive but retains all important information. Interestingly, more information is extracted in response to the standalone prompt (above) and it's different from that for the mixed prompt (below).

AI21 vs GPT-3: screenshot of the davinci model mixed prompt

ChatGPT generates a fairly extractive summary but its named entities are not quite what we want

AI21 vs GPT-3: screenshot of ChatGPT's extractive summary

Jurassic-1 Models

Unfortunately, the Jurassic-1 models didn't do well on this task, both in summary generation and information extraction.

Observations

GPT-3 is capable of processing legal documents out of the box. However, AI21 models show the following problems:

Though the summaries are extractive, they sometimes miss out on entire sections.
The model just isn't capable of extracting information without training.

3. Ad Copy Generation

You can use language models as a copywriter to automatically generate new ads for your products based on their descriptions.

Example 1: Ad Headline

For this task, we start with the following product information as input:

Product information of Lulu Two Strand Braid

GPT-3 Models

Davinci, curie, and ChatGPT ad headline of The Lulu Two Strand Braid

The GPT-3 models did quite well but Curie and ChatGPT didn't always include the brand name. When a new field called "brand" was added and explicitly instructed to mention it, even Curie and ChatGPT started including it:

Curie's ad headline of Lulu Two Strand Braid

Jurassic-1 Models

The AI21 models generated fairly good headlines and followed other instructions like word limits:

J1-grande-instruct ad headline of Lulu Two Strand Braid

Example 2: Ad Body

For this experiment, we selected this random product from the cross-market recommendations dataset and asked the models to generate an ad body for it:

GPT-3

Good ads are eye-catching and exciting. When we asked GPT-3 to produce such ads with specific language to use, it did well:

Jurassic-1

We gave the same instructions to the j1-grande-instruct. While it did OK, it didn't really come up with the kind of excited copy GPT-3 generated and didn't follow some instructions either.

Observations on Ad Generation

Both models did quite well on this task. GPT-3 was effortlessly creative when the prompts asked for it while AI21 remained quite boring throughout.

We also noticed that the AI21 model sometimes duplicated all the results when asked for a particular number of results. If you're planning to automate this stuff, perhaps to distribute ads automatically to Google or Facebook, ensure that you have some post-generation checks.

4. Math Proof Generation

Large language models are capable of generating proofs for simple math theorems. In this experiment, we examine how AI21 and GPT-3 perform on proof-by-cases using few-shot training.

First, we provided an example of proof from number theory. The proof uses slightly complex concepts and symbols, such as modulo, equivalence, and powers. The models are then asked to prove that the absolute value of any number outside (-1,1) will exceed one.

GPT-3 Models

Despite just a single few-shot example, the DaVinci model is able to follow the familiar pattern of proof by cases and prove the theorem without any complications.

Jurassic-1 Models

The j1-grande-instruct model demonstrates some quirks in its proof:

Though asked to prove that |x| > 1, it proves instead that |x| ≥ x.
It introduces an additional variable k though it's not necessary.

While its method produced an acceptable deduction, it's not exactly what was asked.

Few-Shot Training for AI21

Can AI21 perhaps avoid its quirks with some few-shot training? To test that, we gave it some few-shot examples of proof by case:

Unfortunately, the proofs it generated every time were all garbage, regardless of the temperature value:

AI21 vs GPT-3: j1-grande-instruct temparature

Observations

The AI21 model isn't as good as the GPT-3 model on math proof generation. More few-shot examples may help with the output format but it doesn't look like its core math reasoning is capable enough as of now.

AI21 vs. GPT-3: Which LLM Should You Choose?

AI21's web interface is definitely more feature-rich, making it suitable for non-technical users. It did quite well on general summarization tasks.

But when more creativity was required (like for ad copy) or more domain-specific behavior was asked (for legal documents), it didn't do so well out of the box. So we can conclude that GPT-3 is overall more versatile than AI21 as of now.

Large Language Models Are Increasing in Capabilities

Large language models are getting more capable all the time. At the time of this writing, a new 540 billion parameter model capable of using language and robotic sensors together promises new breakthroughs. This is an incredible time for businesses to come on board this ecosystem and use these versatile models for their workflows. Contact us to find out how you can start using these models to make your work easier.

Lets Talk

AI21 vs. GPT-3: Head-to-Head on Practical Language Tasks

What Is AI21?

AI21 vs. GPT-3 — Essentials

1. AI21 vs. GPT-3 Features

2. AI21 vs. GPT-3 Models

3. AI21 vs. GPT-4 and GPT-3 Pricing

Language Tasks for AI21 vs. GPT-3 Evaluation

1. Zero-Shot Strict Summarization

GPT-3 Models

Jurassic-1 Models

Observations

2. Legal Clause Extractive Summarization

GPT-3 Models

Jurassic-1 Models

Observations

3. Ad Copy Generation

Example 1: Ad Headline

GPT-3 Models

Jurassic-1 Models

Example 2: Ad Body

GPT-3

Jurassic-1

Observations on Ad Generation

4. Math Proof Generation

GPT-3 Models

Jurassic-1 Models

Few-Shot Training for AI21

Observations

AI21 vs. GPT-3: Which LLM Should You Choose?

Large Language Models Are Increasing in Capabilities

Keep Reading

How SAP/ERP AI Chatbots Can Boost Your Sales and Customer Satisfaction

[New Feature] Introducing In-Category Product Data Mapping & Analysis - Improve your understanding of products relevance in a category

92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace

97% Accurate 5 Level Deep Product Categorization for Ecommerce Solutions & Ad Placement Company

Hands-On Expert-Level Contract Summarization Using LLMs

Turbocharge Dialogflow Chatbots With LLMs and RAG

The Exact Steps to Implement Custom Ai Shopify Chatbots for Customers

How to Deploy Custom WordPress Chatbots for Happier Customers

Improve Your Customer Workflow With AI: How To Build a Zendesk Chatbot using ReAct

How to Build a Custom Salesforce Chatbot with our Powerful Framework

Everything ML