Width.ai

BART Text Summarization vs. GPT-3 vs. BERT: An In-Depth Comparison

Karthik Shiraly
·
November 10, 2021

You have probably heard of OpenAI's GPT-3 and ChatGPT by now. They're at the heart of all the news about artificial intelligence (AI) becoming sentient and taking over everyone's job. They're quite amazing, from generating poems on the fly to spitting out software code. But, being cloud-based services, they have disadvantages that make them impractical for some businesses and industries.

Imagine the possibilities for your business if you could, instead, use your own private AI that compares well to GPT-3 on some tasks and is completely under your control. In this article, you'll learn about this versatile AI model and how to use it for language tasks like document summarization.

What Is Text Summarization?

Summarization produces a shorter version of a document, called a summary, that ideally shows the following characteristics:

  • It retains the overall meaning and intention of the original text.
  • It retains only the most salient important details and discards all the other details.
  • It's concise by paraphrasing sentences to remove all unnecessary nouns, verbs, adjectives, adverbs, and filler words.
  • It uses generalization, abstraction, and hypernyms to convey the same meaning while reducing the details.

In natural language processing (NLP), we use two types of summarization:

  • Abstractive text summarization: The summary usually uses different words and phrases to concisely convey the same meaning as the original text.
  • Extractive summarization: The summary contains the most important sentences from the original input text sentences without any paraphrasing or changes. The sentences deemed unnecessary are discarded.

In the real world, often both are used together. You may want abstractive summarization for some sections of the document and extractive summarization for others.

In the next section, you'll learn about a deep neural network that can do abstractive summarization.

What Is BART?

BART model architecture — just standard encoder-decoder transformer (Vasvani et al.)
BART model architecture — just standard encoder-decoder transformer (Vasvani et al.)

BART stands for bidirectional autoregressive transformer, a reference to its neural network architecture. BART proposes an architecture and pre-training strategy that makes it useful as a sequence-to-sequence model (seq2seq model) for any NLP task, like summarization, machine translation, categorizing input text sentences, or question answering under real-world conditions. In this article, we'll focus on its summarization capabilities.

How Does BART Summarize Text?

BART is just a standard encoder-decoder transformer model. Its power comes from three ideas:

  • Like Google's pioneering bidirectional encoder representations from transformers (BERT) model, BART's encoder is bidirectional. It uses the context from both sides of a word.
  • Like OpenAI's generative pre-trained transformer (GPT) model, and unlike BERT, BART uses an autoregressive decoder to generate text. An autoregressive model is one whose choice of the current word depends on what it chose as its previous word. Since that's the underlying principle of languages too, autoregressive models can generate natural-sounding language sequences.
  • Like BERT's use of random masked words to force it to learn the language even when crucial words are missing, BART uses multiple noising transformations to deliberately make its data difficult and then learns to generate text well under those conditions.

Pre-Training

Noising transformations during bart training
Noising transformations to deliberately make the learning difficult (Lewis et al.)

During pre-training, BART learns a language model that contains all the complexities and nuances of a real-world natural language. By deliberately introducing all kinds of noise, deletions, and modifications in its input data to make it difficult to learn, BART gains the ability to generate linguistically correct sequences even when input text is noisy, erroneous, or missing.

This denoising language model is versatile and can be adapted for any downstream text generation or understanding tasks like summarization, question-answering, machine translation, or text classification.

Fine-Tuning for Summarization Tasks

To learn to summarize at a high level, a pre-trained BART model is fine-tuned on a summarization task dataset. Such a dataset consists of pairs of input documents and their manually created summaries, often by subject matter experts that have a great idea of the perfect “goal state” summary for the use case. Summaries are often considered “relative” as what it means to be a good summary mostly depends on the user and the information they were expecting to see in the summary. 

Fine-tuning refines the pre-trained language model's network weights to learn summarization-specific concepts like paraphrasing, saliency, generalization, hypernymy, and more. A fine-tuned bart model has learned the underlying relationships between various patterns in the input text and the goal output text.

Summarization Use Cases

How does the BART summarization model compare to the other summarization models out there? Research groups still compare these models using the old recall-oriented understudy for gisting evaluation (ROUGE) metrics. But ROUGE looks for common words and n-grams between the generated and reference summaries — the more there are, the higher the score. Since abstractive models paraphrase the text, they may not score well, and high scores may not result in good summaries under real-world conditions.

So, we compared these models qualitatively by reviewing their summaries, as readers in the real world are likely to. This is the preferred method of evaluating all types of natural language generation. We evaluated them in four industries where long-form texts are common and summaries are useful:

  • Education
  • Health care
  • Legal industry
  • Scientific and applied research

We also evaluated them on longer documents like research papers and books, a real-world phenomenon seen in all four industries.

Models Compared

We evaluated 12 models: four state-of-the-art (SOTA) abstractive models that can be run locally, four SOTA extractive models that can be run locally, and the four famous GPT-3 models running on OpenAI's cloud. Here are the 12:

  1. BART (abstractive) using the bart-large-cnn model from HuggingFace transformers framework
  2. T5 (abstractive)
  3. PEGASUS (abstractive)
  4. PEGASUS fine-tuned on the CNN Daily Mail news dataset (abstractive)
  5. GPT-3 davinci-003 (abstractive)
  6. GPT-3 curie-001 (abstractive)
  7. GPT-3 babbage-001 (abstractive)
  8. GPT-3 ada-001 (abstractive)
  9. BERT (extractive)
  10. Sentence-BERT (extractive)
  11. XLNet (extractive)
  12. GPT-2 (extractive)

All eight local models were run on Kaggle's CPUs without any GPUs. For the four GPT-3 models, the prompt was "Summarize this in 4 sentences:" followed by the document text.

In the sections below, we analyze the summaries generated by these models in each industry.

Summarization in Education

We started with a relatively simple use case of summarizing essays. Both students and teachers may use such a feature to improve their understanding and writing. It's useful outside the education sector too — to summarize news articles, for example. 

We evaluated the models on this student essay from Kaggle's automated essay scoring dataset:

Sample essay (Source: Kaggle's automated essay scoring dataset)

As you can see, the essay has some spelling mistakes, grammar errors, and structural weaknesses. We can consider it a rather noisy input, the kind that's quite common in the real world. The student has used the first-person point of view.

How do the models fare?

bart summarization for essays
  • BART does a good job of producing a grammatically correct summary that covers both the focus areas — health and personality. It also removes all the spurious, unreadable text (like "@CAPS2"). Surprisingly, it manages to produce a third-person summary from a first-person write-up. The flow could be a bit better though.
  • T5 makes some grammar errors and retains some spurious text. The opinions on health are mostly missing.
  • Both the PEGASUS models generate an unacceptably high number of spelling and grammar errors. They too retain some spurious text.
gpt-3 summarization for essays

The four GPT-3 abstractive models fare well, with all of them changing the perspective to the third person and getting rid of the spurious text:

  • Davinci produces a grammatically correct summary with good flow.
  • Curie is a bit weak on the health front, hand waving away the details with just one word — "harmful."
  • Babbage produces a great summary but generates them as a numbered list for some reason.
  • Ada too has good flow but unfortunately generates a few spelling and grammatical errors.
BERT summarization for essays

The four extractive models, though SOTA, retain too many errors and spurious text to be useful in practice. Use them only if your documents are of good quality.

Summarization of Medical Reports

From simple general essays, we move to the highly specialized domain of medical reports. Their summaries are useful to doctors, nurses, researchers, and insurers.

We use a medical report from the Kaggle medical transcriptions dataset as the input document.

Sample medical report (Source: Kaggle medical transcriptions dataset)
Sample medical report (Source: Kaggle medical transcriptions dataset)

The report contains multiple sections that may be important to different medical specialists. An ideal summary must retain all the sections and produce a summary per section. However, due to limits imposed by the models on the input lengths, we had to crop out the last 10% or so of the input document. These limits are not inherent and can be increased by retraining the models.

Bart summarization of medical documents
  • BART, T5, and fine-tuned PEGASUS don't retain the sections at all and discard too much important information. The only plus point for BART is a grammatically correct summary.
  • Base model PEGASUS does a good job of retaining and summarizing the sections.
Gpt-3 summarization of medical documents

The first three GPT-3 models — Davinci, Curie, and Babbage — retain and summarize much of the relevant details (along with some irrelevant ones) though they discard the sections. Considering the size difference between these models it’s interesting to review the similarity of the outputs. 

output summarization results

In contrast, the Ada model reproduces all the sections and also summarizes each one to some extent, but it does feel like it went off-track and retained too much.

BERT summarization of medical documents

The four extractive models discard too many details. But this is not inherent and can be overcome by increasing parameters like the number of sentences to retain.

Legal Documents Summarization

We now move to another specialized domain: legal documents like contracts, agreements, and clauses.

For our tests, we use a rather lengthy legal contract from the contract understanding Atticus dataset.

legal document summarization input
Sample legal contract (Source: contract understanding Atticus dataset)

It has about 23 clauses covering all aspects of a business relationship. Again, due to model limitations (which can be increased), we had to crop the document to about 40%.

Legal document summarization with BART and T5
  • BART's summary is missing all basic details but generates a grammatically correct summary. This allows us to generate summaries with more abstract information then some of the other options. 
  • T5 is also missing most details but at least tells us the two parties who are signing the contract.
  • Base model PEGASUS fares best and retains some important information.
  • Fine-tuned PEGASUS produces the least useful summary. These legal document inputs are way too different from the fine-tuning dataset. This is a great example of how fine-tuning doesn’t mean your model is better at everything!
gpt-3 summarization of legal documents

The Davinci GPT-3 model produces a very high quality abstract like summary with agreement details. Babbage generates a rather extractive summary while Curie fails on this document, possibly due to its odd format.

BERT summarization of legal documents

All four extractive models do well to tell us the signing parties, but the usefulness of the other details is relative as most summarization tasks of information dense documents.

Summarization of Research Papers

The final comparison is on the challenge of summarizing scientific research papers. We use an astronomy research paper from the arXiv summarization dataset as our test sample.

research paper summarization
Sample research paper (Source: arXiv summarization dataset)

It's a lengthy document full of jargon, numbers, and noisy text like LaTeX symbols. Like the previous two domains, we had to crop it, retaining just about 12% of the full document but that includes the introduction and overview parts of the typical research paper.

BART and PEGASUS summarization of research papers
  • None of the abstractive models produce a good summary. All of them, including BART, even retain some spurious text.
  • Base model PEGASUS' summary is slightly better than the other three. Once again we can see that the training data for fine-tuned PEGASUS does not include these types of documents, and it struggles way more than just using the base model.
Davinci model summarization of research papers
Curie model summarization of research papers
Babbage summarization of research papers
Ada summarization of research papers

All four GPT-3 do a fairly good job of summarizing the paper's chosen topic. Davinci takes the task quite literally and summarizes everything, but leaves out the most important one-line summary. Curie, Babbage, and Ada include a one-line summary of the paper and the first two even retain considerable detail.

BART summarization comparison

All four extractive models fare very poorly. It looks like they are tripping over some spurious text.

Handling Long Documents

Three of the four analyses above used long documents that went beyond the limits imposed by the models. For example, the text supplied to BART must not go beyond 1024 subword tokens and longer documents have to be cut. Important context and details that the summarizers can use are lost forever.

In the real world, most documents like contracts and research papers breach these rather small limits. So, what can you do to retain all the information without cutting out anything? There are a couple of approaches you can take:

  • Chunking algorithms: We build these custom algorithms that learn how to understand key contextual information in the documents and dynamically split the text without running the context. These algorithms are the best way to scale your summarization capabilities in long form inputs. We talk more about this in our gpt-3 summarization article.
  • Build long-context models from short-context ones: BART-LS, for example, tinkers with its network architecture and training to handle long documents.
  • Hierarchical summarization: Models like HAT-BART use hierarchical attention transformers that start by looking at the entire document at a high level and zoom in incrementally down to the word level.
  • Iterative summarization: Conceptually, this is like hierarchical summarization but simpler to implement if the quality doesn't matter as much. It's a divide-and-conquer strategy where you divide the document into multiple chunks that are within the model's limits, summarize each chunk, combine the summaries, again divide into chunks, summarize each one, combine, and repeat till the desired length is reached.

11 Reasons to Use BART

Based on the comparisons above, we can observe many benefits (and potential benefits) of using BART for your summarization needs.

1. Most Resilient to Real-World Noisy Data

BART's not the only model that learns to generate text when text is noisy or missing. BERT deliberately masks tokens, and PEGASUS masks entire sentences. But what sets BART apart is that it explicitly uses not just one but multiple noisy transformations. So, BART learns to generate grammatically correct sentences even when supplied with noisy or missing text. That's why it's the only model in our experiments that manages to get rid of most of the spurious text in every domain.

The BART paper demonstrates by scoring better than other comparable models on most benchmarks:

BERT vs BART model comparison
Source: Lewis et al.

2. Acceptable Results Out-of-the-Box Across Many Domains

As we saw above, BART produces good results on a number of input types for advanced model summarization. The other models are more finicky, faring very well sometimes and very poorly other times. Such inconsistency generally requires more post-processing of the generated text using complex rules. BART doesn't suffer those problems as much.

3. Produces Grammatically Correct Summaries

Related to the previous point, grammatical errors of the other models may be particularly difficult to eliminate during post-processing. BART manages to generate grammatically correct text almost every time, most probably thanks to explicit learning to handle noisy, erroneous, or spurious text.

4. BART's Quality Is Comparable to the Smaller GPT-3 Models

As we saw, BART's summaries are often comparable to GPT-3's Curie and Babbage models. So, businesses can get close to GPT-3's quality and versatility but with lower costs and more flexibility regarding fine-tuning and deployments, since BART can run locally on private servers and GPUs.

5. Data Security and Privacy for Businesses

Some industries are just not comfortable about uploading their internal sensitive documents to a cloud provider like OpenAI or AWS. BART enables them to summarize their documents securely with good quality on their private servers.

6. Overcome Limitations of GPT-3

GPT-3 is an amazing service but also has some limitations:

  • Limits on the lengths of input data that you can't change
  • Rate limits and quotas on calling their application programming interface (API)
  • Highly generalized capabilities can sometimes produce poor-quality summaries compared to a specialized model like BART that's only trained for summarization
  • Higher business expenses because each call to an API incurs a financial cost

A model like BART that you can fine-tune and run on your servers doesn't suffer these problems as badly.

7. Simple to Fine-Tune

The HuggingFace library provides excellent pre-trained models and scripts to help you fine-tune BART on your custom data.

8. Backbone for More Advanced Models

BART and PEGASUS are the most common backbones on which more advanced models implement their improvements. These advanced models, like MoCa, BRIO, and SimCLS, that head the SOTA leaderboards on various datasets are using BART as their backbone.

9. Relatively Fast

Among the abstractive non-cloud models, BART often takes less time compared to T5 or PEGASUS. The summarization is near real-time.

10. Runs on CPUs Without GPUs

BART inference runs fine on CPUs without requiring any GPUs. For training, of course, you'd want a GPU.

11. Efficient Model Size

If we compare model file sizes (as a proxy to the number of parameters), we find that BART-large sits in a sweet spot that isn't too heavy on the hardware but also not too light to be useless:

  • GPT-2 large: 3 GB
  • Both PEGASUS large and fine-tuned: 2.1 GB
  • BART-large: 1.5 GB
  • BERT large: 1.2 GB
  • T5 base: 850 MB
  • XLNet: 440 MB

A Highly Capable Language AI for Your Company

A versatile language model like BART is essentially a private, self-managed GPT3-level model for your company's needs. You can use it for intelligent document understanding, machine translation, summarization, document classification, and much more while keeping all your sensitive business information secure and private. Contact us for expertise in building up your AI and machine learning capabilities!

References

  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer (2019). “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. arXiv:1910.13461 [cs.CL]. https://arxiv.org/abs/1910.13461v1 
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). "Attention Is All You Need". arXiv:1706.03762 [cs.CL]. https://arxiv.org/abs/1706.03762