Our SOTA GPT-4 Dialogue Summarization | Zero-Shot, Few-Shot, Aspect-Based

dialogue summarization difference

Day-to-day sales and meeting conversations between your employees and your clients or your customers and your support teams are all rich with information that's potentially useful to your business. You can get new ideas to improve your quality of service, bring new products to market, or provide new services in demand.

But extracting insights from such information is not easy by any means. Most business conversations are mixed with tangential conversations, personal banter, filler words, and similar noise that trips up regular language processing tools. In this article on dialogue summarization with GPT-4, we explore whether OpenAI's GPT-4 large language model is capable of looking past such noise and giving you the insights you need for relevant tasks.

Approaches to Dialogue Summarization Environments With GPT-4

There are three approaches to dialogue summarization environments using GPT-4: zero-shot, few-shot, and fine-tuned. In this article, we focus on the first two approaches.

Zero-Shot Summarization

In zero-shot summarization, the dialogue to be analyzed is fed directly to GPT-4 with an appropriate prompt. You rely entirely on GPT-4's to follow all the instructions in the prompt. For simple goals, the instructions in the prompt are often sufficient. But for more complex goals, even detailed instructions may not be sufficient. That's when you need the few-shot approach.

Few-Shot Summarization

In the few-shot approach, the prompt contains both instructions as well as input-output examples to demonstrate what you want GPT-4 to do. GPT-4 can infer what you want based on the patterns in those input-output pairs and reproduce those same patterns for your target dialogue.

In the sections below, we explore zero-shot and few-shot summarization on various use cases.

Use Case 1: Zero-Shot GPT-4 Summarization of Meeting Transcripts

For our first use case, we select the transcript of a client meeting. The meeting is about implementing summarization for a client who specializes in sports commentaries.

Test Transcript

The meeting audio was transcribed to text by a natural language processing speech-to-text model. The full transcript is shown below:

transcript example
We used speaker name substitution to protect the identity of multiple speakers
transcript example
transcript example
transcript example

Manually Extracted Key Topics in the Transcript

Knowing the key topics in this transcript helps us evaluate the generated summaries later. Some key topics are:

  1. The client's primary business goal of summarizing sports commentaries
  2. How the seller's GPT-3 summarization solution works
  3. Discussion on price estimates for the project
  4. How the solution can be integrated with the client's workflows
  5. The client's concerns over data confidentiality and data ownership
  6. The client's concerns over the confidential data of one client leaking into the summaries of another client
  7. Finally, the client requests a simple project proposal they can take to their management

Zero-Shot Extractive Summarization of the Meeting Transcript

By default, GPT-4 tends to rephrase conversation sections when asked to summarize. GPT-4 can do strictly extractive summarization if told to select sentences rather than write or generate anything.

This prompt helps curb GPT-4's tendency to rephrase and makes it simply select key sentences:

"From the meeting transcript above select the key sentences that convey the gist of the meeting and output them verbatim without any changes, paraphrasing, or rephrasing."

The summary generated by this prompt for our meeting transcript is evaluated next.

Evaluation of the Extractive Summary

The extracted summary is shown below:

transcript extractive summary with zero shot dialogue summarization

As instructed, it has retained sentences from the transcript verbatim in the summary. It does a good job at covering what the service provider does and who they work with, what the customer is looking for, and what the solution would look like at a high level. It even pulls in some key information on what the service provider says about specific details of what goes into the solution.

In terms of the number of key topics covered, this summary scores about five out of seven.

One topic it missed out on is data confidentiality. In the transcript, the client talks for a considerable time about data confidentiality and ownership worries. However, in this summary, they are alluded to only indirectly through a reference to potential hurdles with their legal team. It's a critical topic that's been ignored.

The other topic it partially missed is on integrating this solution with the client's systems. It extracts one sentence about the solution living in the client’s system, but does not discuss further. This could be tuned through aspect based summarization prompts that allow us to focus a bit more on the specific industry. Let’s look at how we can quickly improve this workflow.

Improving our Zero-Shot Extractive Summarization with a Multi-Step Prompt

We can improve the models zero-shot understanding of what information is important in our summary through a new prompt workflow. We first ask GPT-4 to provide us with a list of the key topics discussed in the meeting. You could make this more domain specific if you want, but I kept it generic to show a better understanding of the topics.

multi-step prompt for dialogue summarization

We have life! Our topic extraction picked up on the missed data privacy and ownership concerns as a key topic. It also grabs information about the other topics which we were able to pull already, meaning we won’t lose any context in our next step.

Now we run our same extractive summarization with a small tweak to account for the key topics. We provide the transcript as well as the list of topics. The full production pipeline runs these two steps back to back.

output of zero shot extractive summarization

We can see the extractive sentences now include information about AWS & data privacy. I’d still like to replace a few of these sentences with other sentences that are a bit more valuable, but for zero shot with a very simple prompt it's pretty good. This prompt could still be improved with an aspect based approach (we'll see that below) to hone in on the specific industry and use case. Here’s the output summary from an aspect based prompt we use for our specific transcripts focused on customer discovery.

good extractive summarization output

Next, we'll see how abstractive summarization fares.

Zero-Shot Abstractive Summarization of the Meeting Transcript

For abstractive summarization, we use this simple prompt:

"Write a summary of the conversation above that focuses on the key information discussed."

This is a simple soft prompt that could be run through our prompt optimization framework to create a much more dataset specific prompt.

Evaluation of the Abstractive Summary

The abstractive summary is shown below:

Zero-Shot Abstractive Summarization of the Meeting Transcript

This is a good summary with six out of seven topics covered. It also focuses on the data confidentiality worries that took up considerable time and includes a summary of the offered solution.

The flow between sentences is also excellent. Every sentence here is logically related to the previous one. In a zero shot enviroment, abstractive summaries generally outperform extractive summaries given they get to blend various sentences together and rewrite key ideas. Abstract summaries get to “ingest” the data in a way that the model prefers, where the extractive summaries have to produce the summary as is, which can read a bit awkward with no examples, even if the model fully understands  the key topics. The only change I would consider would be how it doesn't go into great detail on the client's own peculiar challenges in integrating the solution with their existing systems.

Modern Chunking For Dialogue Text Summarization

While the transcripts used above did not require chunking when used with GPT-4 due to them being about 30 minutes long, some longer dialogues such as webinars or longer meetings don’t fit in the context window of GPT-4. For this we use chunking to split the text up and process (summarize) it in chunks. We then use a model at the end of the pipeline to combine the smaller summaries into one summary. Here’s my architecture diagram I built for this way back in 2021.

Width.ai summarization pipeline

One of the questions I am asked constantly by clients is what to do for document chunking. How large should my chunks be? Should I just use a static sized chunk or a sliding window based on content? How do I help my chunk contain as much valuable content as possible while limiting repeating information in different chunks or losing an entire key topic by being split between multiple chunks? Chunking is a critical part of the equation and really drives how well the summarization performs downstream, especially when we consider the idea of combining summaries together based on the key context in smaller chunks.

Let’s look at a chunking framework that works well for dialogue examples. The goal of this framework is:

  • Help the chunk better represent the context provided in the rest of the dialogue.
  • Help the model better understand what information has already been discussed in other chunks. This keeps the model from generating the same key topics from other chunks.
  • Keep the semantic focus of this given chunk more correlated to the key topics of the entire dialogue to improve summarization performance.

Note: This approach works best with chunks that are a larger percentage of the entire dialogue. I always suggest going with larger chunks over smaller ones, given what is discussed above.

Topic Infused Chunking for Dialogue Summarization

topic infused chunking algorithms for dialogue summarization

We split the dialogue chunks into set sizes based on a blend of word count, keywords, and aspects.

We then go to extracting the number one key topic from the above and below chunks. This prompt is a bit different from the one used above as we want to extract the information in a way that lets the model know that there is other information outside of this one key topic that is relevant, but this key topic either highly correlates with the “current dialogue chunk” or doesnt. Another way to do this is to extract a number one key topic from all chunks above and below and combine them into a single “explainer” style topic that reads a bit more like an abstract summary. The goal of this step should make sense. We can now add a bit of information about what is going on in the rest of the dialogue, without bogging down or “overpowering” the current dialogue chunk context.

The blending prompt is used to blend the number one key topic into the current dialogue chunk. We use GPT-4 to generate new language that tells the model that this information has already been discussed or hasn’t been, based on what is talked about in the current dialogue chunk. Once again, this helps the model understand what information already has been deemed as overly valuable and if we should extract it or not in the current dialogue chunk when doing extractive summarization.

The outcome is a new current dialogue chunk that has a bit more information about what topics are discussed above and and below this one. You’ll see when done correctly this greatly affects what key topics are chosen for this specific chunk. This also works much better than the commonly seen approach of a sliding window of all generated key topics from the above chunks + the current dialogue chunk as we now have context of what is below this chunk, and doesn’t bog us down with a ton of very semantically similar above chunk key topics.

Next, we explore the summarization of another kind of dialogue: chats with customer support channels or customer chatbots.

Use Case 2: Zero-Shot GPT-4 Summarization of Customer Chat

The ability to extract key information from customer chats can help companies improve both the quality of their products or services and their customer service.

For our experiments, we run zero-shot extractive and abstractive summarization on conversations between customers and various customer support channels on Twitter.

The Dataset

The TweetSumm dataset from Hugging Face contains about 1,000 customer support conversations with the Twitter support channels of different companies. Each conversation has about three extractive and three abstractive summaries created by human annotators.

Here’s an example conversation:

Example customer support conversation on Twitter (Source: TweetSumm)
Example customer support conversation on Twitter (Source: TweetSumm)

Let's find out how GPT-4's extractive and abstractive summarization fare on this.

Zero-Shot Extractive Summarization of Customer Chat

The reference, human-annotated extractive summaries for the above conversation read like this:

Reference extractive summaries (Source: TweetSumm)

We use the following prompt to tell GPT-4 to select the key details:

"Select the two exact key sentences from the customer support chat above that best describe key custom support issue discussed."

Normally my goal state output definition would be a bit less granular to show wide data variance support coverage across a number of conversation log use cases, but I do think this one could be reused across a number of support issues or channels.

Evaluation of the Extractive Summary

The extracted summary by GPT-4 looks like this:

gpt-4 extractive summarization for customer chat

I asked the model to explain its result as a way to further understand the models selection. This can be used for testing and prompt optimization to iterate the prompt towards a goal state output

The output is very good. These are cleary the key sentences from the from the conversation that best describes the issue. I do wish the model had used a sentence from both the customer and the support, but if we ask for 3 sentences it does exactly that.

another exact of summarization of the customer chat

Let’s take a look at the results when focusing on an abstractive summary.

Zero-Shot Abstractive Summarization of Customer Chat

For the abstractive summary, we use the following prompt:

"For the customer chat above, write a summary with a maximum of 2 sentences on the customer's problem, focusing on the key information. On a new line, write a single-sentence summary of the solution focusing on the key information."

Evaluation of the Abstractive Summary

The human-annotated, reference, abstractive summaries for this conversation are shown below:

Reference abstractive summaries (Source: TweetSumm)
Reference abstractive summaries (Source: TweetSumm)

Here’s the abstractive summary generated by our GPT-4:

generated abstractive summary of dialogue with gpt-4

It describes the core problem accurately and matches the reference summaries as well.

It has also followed the formatting instructions we gave to write the solution on a separate line.

Use Case 3: Aspect-Based Summarization & Aspect-Based Topic Extraction of Customer Chat

Aspect-based summarization when done correctly is a great way to extract different summaries from the same text with zero changes to the prompt structure, outside of what aspect to focus on. This lets you take care of different summarization use cases without building an entirely new system for each use case. This should all happen with the summaries having a high level of differentiation between chosen aspects.

Zero-Shot Extractive Aspect-Based Summarization of Customer Chat

Here we outline a prompt in a format that allows us to easily replace the key aspect to focus on in the instructions without rewriting the prompt language or having the prompt read awkward.

The prompt structure looks like this:

Zero-Shot Extractive Aspect-Based Summarization of Customer Chat

We use a variable {aspect} as our adjustable aspect that we want to extract. On top of zero-shot environments and when the data is available, aspect based summarization is also a pretty good use case for few-shot prompts with a few examples to help the model fully understand how to correlate exact sentences to more difficult or “abstract” aspects. We could use a dynamic prompt if we had a database of example summaries that make sense.

Evaluation of the Extractive Aspect-Based Summary

Let’s use an interesting aspect that is specific to this use case. Whats the point in showing examples of generic summarization prompts now that we can quickly substitute out various focus points? Here we focus on console sleep mode.

Evaluation of the Extractive Aspect-Based Summary

This is a pretty awesome zero shot example. The model is smart enough to know we don’t need to find the exact keyword. It even understands console turning off should be correlated to a sleep mode! These are the only two sentences that are related to sleep mode specifically as we can see asking for 4 sentences returns sentences that aren’t nearly as correlated to sleep mode specifically.

Evaluation of the Extractive Aspect-Based Summary with more sentences

When I widen the data variance coverage of the aspect we’re using to “various console modes” we can see it extracts sentences related to other console modes or operations to perform.

Evaluation of the Extractive Aspect-Based Summary with widened data variance


One thing I constantly see from clients is issues with accuracy directly correlated to simple grammar or communication of the instructions to the prompt issues. As good as GPT-4 is at understanding wide ranges of english language “levels”, this is still a language model that prefers clear instructions and correct grammer. If I remove the phrase “exact key” from the prompt it produces a completely different list of sentences. These sentences are not nearly as correlated to the aspect as the example above. Little stuff like that goes a long way.

prompt language adjustment to show the importance of instructions

A big part of production level prompting is just being clear with what the goal state output should be, and iterating on that when the accuracy doesn’t reach a level we’re looking for. Having these parts of our prompt as concrete as possible allows us to even use aspects that are very abstract and don’t focus on a single identifiable aspect, but forces the model to decide that as well. Here’s an example:

Zero-Shot Aspect-Based Topic Extraction

This same aspect-based workflow can be used for key topic extraction. Instead of asking for key sentences we ask for key topics. This is a great way to extract high level short descriptions of the entire customer discussion. You’ll notice that these key topics follow the flow of the entire conversation so you’ll get a grasp of the topic throughout the entire conversation.

Zero-Shot Aspect-Based Topic Extraction

I got tricky with this and decided I want any key topics except those related to the connect. A difficult task when you consider the entire conversation revolves around the Kinect.

negative focused aspect based summarization

While I kept this negative prompting framework private, it follows the concepts from one of my favorite prompting articles by Allen Roush.

Implement Dialogue Summarization With GPT-4

Want to integrate these workflows into your product?

Contact us for expert help on streamlining your business workflows with GPT-4 and custom models.


  • Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, Ranit Aharonov (2021). "TWEETSUMM -- A Dialog Summarization Dataset for Customer Service." arXiv:2111.11894 [cs.CL]. https://arxiv.org/abs/2111.11894