Width.ai

Evaluating GPT-4 Zero-Shot Summarization for Streamlining Workflows


summarization workflow for new article summarization

Documents can often have low signal-to-noise ratios, forcing employees to spend a lot of time reading entire documents just to extract the information that's actually valuable to the business. That's why the summarization of large volumes of text has become one of the most appreciated natural language processing (NLP) capabilities of large language models (LLMs).

In this article, we explore OpenAI's latest generative pre-trained transformer's (GPT's) capabilities for summarization. Specifically, we explore how GPT-4 zero-shot summarization compares against the older GPT-3 as well as state-of-the-art, self-hosted language models on their summarization capabilities.

GPT-4 vs. GPT-3 Capabilities

In this article, the term "GPT-3" covers both GPT-3.5 models, like gpt-3.5-turbo and text-davinci-003 (GPT-3 davinci “instruct”), as well as the older GPT-3 models. All GPT-3 evaluations here were done using the gpt-3.5-turbo model which is considered the best model of its generation.

Compared to it, GPT-4:

  • Supports 2x-8x higher token limits than GPT-3, going up to 32,000 tokens
  • GPT-4’s RLHF training seems to cause it to follow prompt examples less than GPT-3. This does mean it can adventure more to be more creative, but can be more difficult to wrangle into a specific format.
  • Has a better understanding of code and how to generate complete scripts. One of the largest iterations from GPT-3 to 4 was the model's ability to handle code.

In the following sections, we compare GPT-4 against GPT-3 and state-of-the-art, self-hosted models on three zero-shot summarization use cases:

  • News summarization examines GPT-4's ability to retain important information in abstractive summaries
  • Long-form blog post summarization explores how summaries compare when they exceed token limits

Both experiments use zero-shot summarization which means the models are only provided with an instruction and the text to summarize. Unlike few-shot learning, demonstrative examples are not provided to guide the GPT models. We don't use any fine-tuned GPT models either, just the stock ones available at the application programming interface (API) endpoints. However, some of the self-hosted models may have been fine-tuned for summarization.

Abstractive News Summarization With GPT-4

In 2022, Goyal et al. evaluated GPT-3 on news summarization using the GPT-3.5 text-davinci-002 model. They compared its results with a state-of-the-art, self-hosted model called BRIO that's fine-tuned for summarization.

Here, we find out how GPT-4 stacks up against both GPT-3.5 and BRIO on the same datasets and benchmarks.

Approach

Like the paper, we summarize articles from the CNN/Daily Mail news dataset. The dataset provides human-annotated reference summaries that are highly extractive but not fully so (for example, select phrases may make it to the summaries but not their parent sentences).

Since these articles are quite short and don't exceed the token limits of any of the models, all the details have a chance to make it to the summaries without getting cut out.

We instruct both GPT-4 and GPT-3 to summarize via the chat completion API. The temperature parameter that influences the creativity of generations is set to 0.75 for all experiments.

For summarization using BRIO, we use Hugging Face's pipeline interface.

We then evaluate the generated summaries using the reference-based and reference-free metrics mentioned in the paper, like ROUGE, BERTScore, and SUPERT.

Additionally, we also have them evaluated qualitatively by human evaluators who are asked to select their most and least preferred summaries and provide subjective reasons for those choices.

Prompts For Summarization

Two different instruction prompts are run with both GPT models. The first prompt is the same simple one used by Goyal et al., with number of output sentences=3:

prompt for news summarization
Simple prompt for news summarization (Source: Goyal et al.)

The second prompt is more complex, designed to test GPT-4's ability to follow complex instructions:

Complex gpt prompt for news summarization
Complex prompt for news summarization

Example News Article Provided for Evaluation

One test article is a medical story of a kidney transplant that ended up saving six lives thanks to a computer program that finds groups of genetically compatible donors and recipients who can exchange kidneys safely. An excerpt from the article and the gold reference summary are shown here:

Test article and reference summary from the CNN/Daily Mail dataset
Test article and reference summary from the CNN/Daily Mail dataset (Source: Nallapati et al.)

Generated Summaries

GPT-4 and GPT-3 generate these summaries with the simple prompt:

generated summaries from gpt-4

GPT-4 ignores the donor's story and focuses almost entirely on the computer program and its creator. In contrast, GPT-3 focuses equally on both the donor and the enabling program.

The complex prompt results in a more balanced focus by GPT-4. However, the concept of a chain of matches is better explained by GPT-3:

complex prompt results with gpt

Unlike all the above summaries, BRIO focuses entirely on the donor and attributes the success to big data instead of the specific program and its creator:

Brio results

ROUGE Metrics

ROUGE metrics measure text overlaps. If a generated summary has many of the same words and n-grams as its reference summary, its ROUGE F1-scores will be higher. They're good metrics for extractive summaries but not ideal for abstractive summaries.

Rogue metrics for gpt simple and complex prompts
ROUGE F1-scores for summarization

Across five random news articles, GPT-3 with the complex prompt scores better than all the other GPT variants. However, GPT-3 with the simple prompt is nowhere near. Which instruction in the complex prompt is causing this contrasting behavior remains unclear. A prompt ablation study or a chain-of-thought deconstruction may throw more light.

BRIO outperforms all the GPT models on these metrics, suggesting that BRIO's summaries are somewhat more extractive and less abstractive. The paper's authors too got the same results:

ROUGE metrics comparison in the paper
ROUGE metrics comparison in the paper (Source: Goyal et al.)

BERTScore Metrics

BERTScore is a much more semantic evaluation of the summary because it uses a language model to evaluate how close the generated summary is to the reference summary. The table below shows the mean F1-scores of each model's BERTScore:

bertscore results
BERTScores for summarization

All the models perform equally well but zero-shot GPT-4 is slightly better than zero-shot GPT-3 and on par with the fine-tuned, state-of-the-art model.

Long-Form Blog Post Summarization

Summarization of long-form documents like articles, white papers, legal agreements, medical reports, or research papers can help many industries improve employee productivity by optimizing time spent on reading.

Unfortunately, all LLMs have token limits that hinder their use for this task. The future looks promising, given that there's already a GPT-4 variant with about 8,000 words available to some users, and alternative models like Anthropic that support 25,000 words are on the horizon.

But for now, most GPT-4 users are limited to 8,192 tokens or about 2,000 words. To overcome this limit, we must use strategies like chunking. In this section, we evaluate GPT-4's summaries for long-form documents using long-form blog posts as examples.

Approach

We run these summarization experiments on the longest posts available in the wikiHow knowledge base. It's available as a summarization dataset with reference summaries.

As of March 2023, the state-of-the-art summarization model for this dataset combines a BERT encoder with separate decoder generators for abstractive and extractive summaries, introduced by Savelieva et al.

We explore how GPT-4 and GPT-3 zero-shot summarization compare with it. Both models are asked to generate abstractive and extractive summaries using suitable prompts, and the generated prompts are evaluated on ROUGE metrics.

We deliberately select posts that are longer than the token limits of these GPT models. This allows us to evaluate recursive summarization through chunking. We use the LangChain map-reduce pipeline and set the chunk size to 3,000 tokens and chunk overlap to 100 tokens to preserve some local context between chunks.

For the abstractive prompts, the temperature parameter is set to 0.75 so that the GPTs are free to rephrase the posts. But for extractive prompts, it's set to zero to prevent any kind of rephrasing and just select sentences in the posts.

Out of the 215,364 posts in the dataset, only nine are longer than 8,192 tokens (GPT-4's token limit) and only four of those nine are general posts rather than highly technical posts. These four are used for the evaluation.

Prompt for Abstractive Summarization

For abstractive summarization, we didn't provide any custom prompt and just relied on LangChain's prompt:

prompt for abstractive blog summarization

Prompt for Extractive Summarization

For extractive summarization, we provided the following custom prompt:

prompt for extractive summarization

Since GPT-3, in particular, showed a tendency to rephrase sentences despite being told to just "select 3 sentences," it was asked to "just pick 3 sentences" instead and explicitly instructed not to rephrase any sentence. This prompt iteration worked much better than “select”.

Example Post

An example post and its reference summary are shown below:

example post
reference summary provided

Generated Summaries

The abstractive summaries generated by GPT-4 and GPT-3 are shown below:

summary results

Specific instructions have been generalized into more abstract guidelines in these summaries.

The extractive summaries are shown below:

extractive summary results

Since our prompt limited it to just three key sentences, it's picked three that convey the main ideas of the article. Arguably, GPT-3 has done a slightly better job here with more emphasis on the do’s and don'ts than GPT-4.

Evaluation

The mean ROUGE F1-scores for GPT-4 and GPT-3 are shown below:

evaluation results

We see that GPT-4 abstractive scores less than GPT-3 abstractive, indicating that GPT-4 is much more abstractive for the same prompt.

At the same time, GPT-4 extractive scores are higher than GPT-3 extractive, indicating that GPT-4 follows strict instructions better.

Overall the mean F1-scores are low because the reference summary is generated by summarizing each paragraph but our prompts are designed to zero-shot summarize an entire long-form document rather than paragraph by paragraph.

The metrics from the BERTSum model in the paper are shown below. Since it summarizes paragraph by paragraph, it scores higher:

ROUGE metrics of BertSum model (Source: Savelieva et al.)
ROUGE metrics of BertSum model (Source: Savelieva et al.)

Opinion Summarization

Many businesses find themselves overwhelmed by the deluge of reviews and feedback their products and services receive. Often, these reviews contain nuggets of useful information that are valuable to the business. But the volume of data makes it difficult to isolate the useful information from all the noise.

In this context, opinion and aspect summarization can help businesses extract useful information from product reviews by isolating reviews along multiple aspect axes. This approach was explored by Bhaskar et al. in their paper, Zero-shot opinion summarization with GPT-3. In this section, we extend their exploration to GPT-4.

Approach

Our approach uses one of the summarization pipelines used by the paper on the hotel reviews from the summaries of popular and aspect-specific customer experiences (SPACE) dataset. This dataset contains about 100 reviews for some 50 hotels with each review spanning about 10-15 sentences. In addition, it provides reference summaries along different aspects of a hotel, like its service or its cleanliness.

The pipeline replicates the topic-wise clustering with chunked GPT-4 summarization (TCG) approach. The topics here are the six aspects on which the reviews are evaluated — rooms,  building, cleanliness, location, service, and food.

Its summaries are evaluated using ROUGE and BERTScore metrics and compared with GPT-3's metrics from the paper.

Aspect-Wise Clustering + Recursive Chunked GPT-4 Summarization

This pipeline first labels the review sentences with aspects to yield six aspect clusters. Each cluster of sentences focuses on a single aspect. We then summarize each aspect cluster using GPT-4.

For the aspect labeling, the paper uses GLOVE vectors. However, we experimented with two alternatives:

Labeling Sentences With GPT-4 Chat Prompts

We found that the first approach of labeling sentences using GPT-4 chat prompts was sometimes unreliable and often unacceptably slow. The first prompt we tried was based on counts and looked like this:

Labeling Sentences With GPT-4 Chat Prompts

However, for this prompt, GPT-4's unreliability manifested in many ways:

  • If we sent a large number of sentences such that the prompt was close to the token limit, GPT-4 would output only a dozen or so labels. It turns out that the token limit applies not just to the input prompt but across an entire request and response. So, we can only fill about half the token limit in the request.
  • The second problem was the inability of GPT-4 to return an exact number of results. If we sent 50 sentences, we expected 50 labels. However, GPT-4 would sometimes generate 49, 50, or even 53. There was no predicting the exact number.

To overcome these problems, a second prompt was tried that didn't rely on explicit counts but enforced it implicitly by forcing GPT-4 to fill in a structured JSON result:

sentence classification with gpt-4

This worked reliably. But the problem was that it was unacceptably slow. Classifying 50 sentences would take about 30 minutes and we had hundreds of sentences per aspect across 10 hotels. Lesson learned: This approach is not good in any production scenario.

The alternative was to use embeddings.

Labeling With OpenAI Embeddings API

The embeddings API enabled us to label thousands of sentences within a few seconds.

To enrich the semantic meanings of the aspect labels, we converted them to these seven aspect phrases:

  • Hotel rooms
  • Hotel building
  • Hotel cleanliness
  • Hotel location
  • Hotel service
  • Hotel food
  • Information

OpenAI's latest "text-embedding-ada-002" engine generated 1,536-dimensional embeddings for these seven aspect phrases.

We then generated embeddings for all the review sentences using the same engine. Since these are unit-normalized embeddings, finding semantic similarity was as simple as running a matrix dot product between the sentence embeddings matrix and the aspect embeddings matrix.

Each row in the resulting matrix corresponded to a review sentence and contained cosine similarities with the seven aspects. We picked the highest two values in each row as the best label and the next best label. We picked two labels instead of just one because many sentences were actually relevant to multiple aspects.

All the review sentences of each hotel were then clustered under their best and next-best aspect labels. We ended up with seven aspect clusters per hotel.

GPT-4 Recursive Chunking Summarization

Like the paper, we grouped each aspect cluster into chunks of 30 sentences, summarized each chunk separately, and then applied recursive summarization.

We used the same prompts as the paper:

  • For the first level of summaries that directly examine the reviews, the prompt would be: "Here’s what some reviewers said about a hotel: <SENTENCES> Summarize what the reviewers said of the <ASPECT>:"
  • For the higher level summaries of the summaries, the prompt would be: "Here are some accounts of what some reviewers said about the hotel: <SENTENCES> Summarize what the accounts said of the <ASPECT>:"

We got a final summary of about 30 sentences for each aspect and for each hotel.

Example Aspect-Wise Labeling of Reviews

The image below shows examples of sentences grouped under two aspects of hotel reviews — rooms and service — using the embeddings API:

Example Aspect-Wise Labeling of Reviews

Some of the sentences are clearly relevant to other aspects like location too. That's why we labeled each sentence with not just the most relevant label but also the second relevant label.

Example Aspect-Wise Zero-Shot Summaries Generated by GPT-4

The image below shows some of the aspect-wise summaries generated by GPT-4 through zero-shot summarization:

example outputs of aspect-based summarization

The summaries contain valuable insights extracted from around 100 reviews, covering both positive and negative aspects. It's clear that they've provided many actionable areas of improvement. That kind of useful information would have needed a lot of staff and manual effort in the past.

Evaluation

The ROUGE and BERTScore metrics from the paper for GPT-3 pipelines are shown below:

ROUGE metrics from zero-shot GPT-3 summarization paper
ROUGE metrics from zero-shot GPT-3 summarization paper (Source: Bhaskar et al.)

We focus on the results of the TCG pipeline because that's the one we adapted for GPT-4 too. The scores for the same pipeline but implemented using OpenAI embeddings API and GPT-4 recursive summarization are shown below:

results

Compared to the GPT-3 pipeline, our GPT-4 pipeline doesn't score as high on ROUGE but scores higher on BERTScore.

GPT-4's Zero-Shot Summarization Streamlines Business Workflows

In this article, we explored how GPT-4 fares against GPT-3 and state-of-the-art self-hosted summarization models. GPT-4's zero-shot summarization advancements enable businesses to extract high-quality information and insights just by using suitable prompts without having to spend time fine-tuning custom language models. Talk to us to find out how you can use it in your specific workflows.

References

  • Tanya Goyal, Junyi Jessy Li, Greg Durrett (2022). "News Summarization and Evaluation in the Era of GPT-3."   arXiv:2209.12356 [cs.CL]. https://arxiv.org/abs/2209.12356
  • Yixin Liu, Pengfei Liu, Dragomir Radev, Graham Neubig (2022). "BRIO: Bringing Order to Abstractive Summarization." arXiv:2203.16804 [cs.CL]. https://arxiv.org/abs/2203.16804
  • Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos santos, Caglar Gulcehre, Bing Xiang (2016). "Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond." arXiv:1602.06023 [cs.CL]. https://arxiv.org/abs/1602.06023
  • Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, Mirella Lapata (2020). "Extractive Opinion Summarization in Quantized Transformer Spaces." arXiv:2012.04443 [cs.CL]. https://arxiv.org/abs/2012.04443
  • Mahnaz Koupaee, William Yang Wang (2018). "WikiHow: A Large Scale Text Summarization Dataset." arXiv:1810.09305 [cs.CL]. https://arxiv.org/abs/1810.09305
  • Alexandra Savelieva, Bryan Au-Yeung, Vasanth Ramani (2020). "Abstractive Summarization of Spoken and Written Instructions with BERT." arXiv:2008.09676 [cs.CL]. https://arxiv.org/abs/2008.09676
  • Adithya Bhaskar, Alexander R. Fabbri, Greg Durrett (2022). "Zero-Shot Opinion Summarization with GPT-3." arXiv:2211.15914 [cs.CL]. https://arxiv.org/abs/2211.15914
  • Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, Mirella Lapata (2020). "Extractive Opinion Summarization in Quantized Transformer Spaces."   arXiv:2012.04443 [cs.CL]. https://arxiv.org/abs/2012.04443