4 Powerful Long Text Summarization Methods With Real Examples

Matt Payne
October 20, 2022

text summarization models

Text summarization is an NLP process that focuses on reducing the amount of text from a given input while at the same time preserving key information and contextual meaning. With the amount of time and resources required for manual summarization, it's no surprise that automatic summarization with NLP has grown across a number of different use cases for many different document lengths. The summarization space has grown rapidly with a new focus on handling super large text inputs to summarize down into a few lines. The increased demand for the summarization of longer documents such as news articles and research papers has driven the growth in the space.

The key changes that have led to the new push in long text summarization are the introduction of transformer models such as BERT and GPT-3 that can handle much longer input sequences of text in a single run and a new understanding of chunking algorithms. Past architectures such as LSTMs or RNNs were not as efficient nor as accurate as these transformer based models, which made long document summarization much harder. The growth in understanding of how to build and use chunking algorithms that keep the structure of contextual information and reduce data variance at runtime has been key as well. 

Difference Between Large & Small Text Summarization

Packing all the contextual information from a document into a short summary is much harder with long text. If our summary has to be say 5 sentences max it is much harder to decide what information is valuable enough to be added with 500 words vs 50,000 words. 

Chunking algorithms are often required, but they do grow the data variance coverage a model must have to be accurate. Chunking algorithms control how much of the larger document we pass into a summarizer based on the max tokens the model allows and parameters we’ve set. The new dynamic nature of the input data means our data variance is much larger than what is seen with smaller text.

Longer documents often have much more internal data variance and swings in the information. Use causes such as blog posts, interviews, transcripts and more have multiple swings in the dialog that make it harder to understand what contextual information is valuable for the summary. Models have to learn a much deeper relationship between specific keywords, topics, and phrases as the text grows. 

There are two main types of summarization that are used as the baseline for any enhanced versions of summarization - Extractive and abstractive. They focus on how the key information found in the input text is reconstructed in the generated summary in their own ways. Both of these methods have their own unique challenges that pop up when looking at using longer text.  

6 Key Methods of Long Text Summarization

There are a number of different methods of summarizing long document text using various architectures and frameworks. We’ll look at some of the most popular ones used today and the various use cases that we have seen them perform exceptionally well for. 

Long Text Extractive Summarization with BERTSUM

Extractive summarization works to extract near exact phrases or key sentences from a given text. The processing pipeline is similar to that of a binary classification model that acts on each sentence. 

BertSum is a fine-tuned BERT model with a focus on extractive summarization. BERT is used as the encoder in this new architecture and has state of the art ability in handling long inputs and word to word relationships. This pipeline of using BERT as a baseline and creating a domain specific fine-tuned model is pretty popular as leveraging the baseline training can provide a large accuracy boost in the new domain and requires fewer data points.

bertsum model
BertSum Research Paper

The key change when going from baseline to BertSum is how the input format changes. We are representing individual sentences instead of the entire text at a single time so we add a [CLS] token to the front of each sentence instead of the entire document. When we run through the encoder we use these [CLS] tokens to represent our sentences. 

Pros & Cons of BertSum

A key pro of the BertSum architecture is the flexibility that comes with BERT in general. The size and speed of these models are often something that becomes part of the equation when looking at how to get through long input text. There are a number of variations of BertSum that use different baseline BERT models such as DistilBERT and MobileBERT which are much smaller and achieve nearly identical results.

accuracy metrics of different bert models
Nearly Identical accuracy numbers yet much smaller (source)

Another pro generally seen with extractive summarization is the ease with which you can evaluate for accuracy. Given the near exact sentence nature of the summarized text, it’s much easier to compare to a ground truth dataset for accuracy than it is with abstractive summarization.

The main con we see with long text summarization using BertSum is the underlying token limit of BERT. BertSum has an input token limit of 512 which is much smaller than what we see today with GPT-3 (4000+ in the newest instruct version). This means for long text summarization we have to do a few special things:

1. Build chunking algorithms to split up the text.

2. Manage a new level of data variance considering each text chunk doesn’t contain all contextual information from above.
3. Manage combining chunk outputs at the end.

I won’t dive into the full level of depth on this point (considering I did it here: gpt-3 summarization), but the even lower token limit of BertSum can cause issues. 

News Article Summarization With BertSum

We use BertSum to create a paragraph length summary of news articles that contain exact important sentences pulled from the longform text. BertSum lets us adjust the length of the summary that we want to use which directly affects what information from the article is deemed important enough to be included. 

We can create a paragraph-long article summary of this article: https://www.cnn.com/2020/05/29/tech/facebook-violence-trump/index.html in a pipeline containing multiple modules. The summary is seen below made up of the main points. 

summary example from article

Blog Post Summarization With BertSum

We’ve used the same exact existing models pipeline seen above to summarize any blog posts or guest posts of any size into a short and sweet summary. We took one of our longest articles with over 6k words and summarized it into a paragraph. We can always adjust the output length of the summary to better fit the amount of information or length of the blog post as well. 

blog post summary from article

Book Summarization With Human Feedback

OpenAi has released a new summarization pipeline focused on using reinforcement learning and recursive task decomposition to summarize entire books with abstractive summarization. They use a fine-tuned GPT-3 model that is created with a large number of summaries created by humans. The model works to summarize small sections of the book and then recursively summarize the generated summaries into a final full book summary. They end up using a number of models trained to work on smaller parts of the larger task that help humans provide quality feedback on the summaries. 

The model ends up achieving close to human level summarization accuracy for 5% of the books and just slightly less than human level 15% of the time. 

book summarization architecture

Pros & Cons of the Book Summarization Approach

The approach taken by OpenAi solves a key problem often seen in abstractive summarization of super long input text. Human feedback evaluation and comparing machine summaries to human ones are some of the most popular ways of evaluating summaries for accuracy. One of the problems here is that a human would have to have read the entire book to write a summary if we are using a model that generates from the entire book. This makes it very cost inefficient and resource intensive to evaluate the generated summaries. With the smaller chunk approach, the human evaluator just has to understand and evaluate the smaller section of the book.

The recursive task decomposition route includes a number of benefits in evaluating the system. It’s much easier to break down and trace the summarization process with chunks over trying to understand how the model goes from full text to summary. We can see greater detail in the intermediate steps before generating the final full summary that helps us map key information. 

Given the broken down approach, you could argue there are no limits to the size of the book length. This is a benefit that contrasts what we saw earlier with the single model BertSum approach.

Summarization Example

summarization output example

Longformer Summarization: Long-Document Transformer

long document transformer architecture

As we saw in the cons for BertSum, transformer based models struggle to handle super long input sequences due to their self-attention mechanism which scales quickly as the sequence length grows. The longformer works to fix this as it uses an attention mechanism that scales linearly with the sequence length as a local windowed attention and task motivated attention. This makes it much easier to process documents with thousands of tokens in a single instance. We use a variation of this called Longformer Encoder-Decoder for the summarization task. 

The longformer architecture is set up in a similar way to how we use BERT in the modern age. The longformer is often used as a baseline language understanding model and then retrained or fine-tuned for various tasks such as summarization, QA, sequence classification, and many more. The architecture consistently outperforms BERT based models such as RoBERTa on long document tasks and has archived state of the art results in a few domains. 

Given many of the struggles we discussed in the above sections that come with summarizing long text with any model it's pretty easy to see how the longformer encoder decoder benefits us. By extending the number of tokens we can use in a single run we can limit how many chunks of text we need to create. This greatly helps us hold the important context from the long document in place without splitting it into any group of chunks, which leads to stronger summaries at the end. The longformer also makes it much easier to combine all the smaller chunk summaries into one final summary as we have less to combine. 

Longformer Summarization With 8k Tokens

longformer code example
Code example

Recently the longformer encoder-decoder architecture was fine-tuned for summarization on the very popular Pubmed summarization dataset. Pubmed is a large document dataset built for summarization of medical published papers. The median token length in the dataset is 2715 with 10% of the documents having 6101 tokens. With the longformer encoder-decoder architecture we will set the maximum length to 8192 tokens and can be done on a single GPU. In the most popular examples, we set the maximum summary length to 512 tokens. 

Long Text Abstractive summarization with GPT-3

gpt-3 cover example

What is Abstractive Summarization?

Abstractive summarization focuses on generating a summary of the input text in a paraphrased format that takes all information into account. This is very different from what we see with extractive summarization as abstractive summarization does not generate a paragraph made up of each “exact best sentence”, but a concise summary of everything. 

What is GPT-3?

GPT-3 is a transformer based model that was released by OpenAi in June of 2020. It is one of the largest transformer models to date with 175 billion parameters. The large language model architecture was trained on a very large amount of data including all of common crawl, all of Wikipedia, and a number of other large text sources. The model is autoregressive and uses a prompt based learning format which allows it to be tuned at runtime for various use cases across NLP.

GPT-3 For Long Text Abstractive Summarization

GPT-3 has a few key benefits that make it a great choice for long text summarization:

1. It can handle very long input sequences 

2. The model naturally handles a large amount of data variance 

3. You can blend extractive and abstractive summarization for your use case.

The last bullet is by far the biggest benefit of using GPT-3. Oftentimes summarization tasks are not strictly extractive or abstractive in the nature of what a completely correct output looks like. Some use cases such as this key topic extraction pipeline the summary is presented as bullets with some parts of exact phrases from the text, but mirror more of a 75/25 abstractive split. This ability to blend the two rigid ideas of summarization is an outcome of the extreme flexibility offered by the prompt based programming of GPT-3. The prompt based logic allows us to start generating these custom summaries without needing to find a pretrained model exactly like we need or creating a huge dataset. 

width.ai zero shot summarizer tool pipeline
Width.ai Zero Shot Summarizer Tool Pipeline

Get Started With Long Text Summarization

Width.ai is an NLP & CV development and consulting company focused on building powerful software solutions (like a text summarizing tool!) for amazing organizations. We help you through the entire process of outlining what you need for your use case, what models will produce the highest accuracy, and how we build exactly what you need. Speak to a consultant today to start outlining your own long text summarization product. 

width.ai logo