Improving Legal Document Summarization Using Deep Clustering (DCESumm)

Patrick Hennis
August 21, 2023

Automatic text summarization of legal text like contracts, agreements, or court judgments using artificial intelligence techniques comes with difficult problems like not including critical sentences, missing topics due to chunking problems, partial coverage of sections, and including irrelevant sentences and topics. Overcoming these problems requires careful analysis and the use of appropriate pipelines that are use case specific.

In this article, we review a state-of-the-art pipeline called Deep Clustering Enhanced Summarization (DCESumm) for improving legal document summarization using deep clustering techniques.

Problems in Legal Document Summarization

Legal documents are often long and divided into multiple distinct sections. They can also contain specific boxes and ordering that makes it difficult to evaluate the informativeness of a sentence to the overall document, which is a prerequisite for creating a good summary.

The example legal documents below demonstrate some of these problems.

Example 1 - Data Use Agreement

Agreements and contracts are typically divided into distinct sections as shown below:

A data use agreement with sections
A data use agreement with sections (Source: New York State Office of Mental Health)

If the user wants every section to remain in the summary too, then section-level summaries must be generated. Some sections can be quite lengthy. Every sentence's relevance and informativeness must be evaluated in the scope of its parent section.

Most legal contracts and agreements are written in very sophisticated language. Lawyers try to hide important information by using obscure words and difficult phrases. These can be hard to read and understand even for humans. When generating summaries for such contracts, the model might not mention these tiny but relevant details. To fix this, individual summaries should be generated for every section in the contract, and if the sections are too long, they should be broken down into smaller chunks and processed accordingly. Once all the chunks are processed and summarized, they can be combined with a well written prompt that retains the valuable information from all the chunks.

This approach ensures that no important information is lost even when the context length is very long or when the provided content to summarize is very sophisticated. Because the contract is broken down in chunks, it is made sure that every part of the contract is processed in a more focused manner.

Example 2 - Court Judgments

A typical judgment is shown below:

an example court judgment
An example court judgment (Source: The Caselaw Access Project)

Some judgments have sections and others don't. Additionally, different topics or themes (e.g., injuries, negligence, or compensation in the above example) may be dispersed throughout the judgment and referred to multiple times. This can cause issues when understanding the key topics of the document and the relevancy they have in the final summary, especially when writing extractive summaries.

For such documents, a good summary involves identifying these themes throughout the document, evaluating their related sentences and the informativeness of each sentence in the entire document, and including only the most informative sentences in the summary.

How DCESumm Improves Legal Summarization — Intuition

How relevant is a sentence to the overall document? How much does it contribute to the overall informativeness of the document? Is this sentence strongly correlated to the key topics or entities? These questions decide whether a sentence belongs in a reasonable summary. DCESumm answers them through two key innovations.

1. Supervised Scoring of Sentence Relevance in Summaries

Sentence scoring workflow for legal document summarization
Sentence scoring (Source: DCESumm)

Most summarization approaches tend to evaluate sentence relevance using simple metrics like cosine similarity or Euclidean distance.

However, an alternative is to use supervised machine learning models trained on labeled summarization datasets. This enables more complex modeling of the relevance scoring. DCESumm does so with a deep neural network.

2. Relevance Score Enhancement With Deep Embedded Clustering

Score enhancement with clustering
Score enhancement with clustering (Source: DCESumm)

A trained relevance scoring model may score some sentences wrong due to ignoring their global informativeness. So, the target document's unique structure and global context should be used to correct the sentence scores.

For that, DCESumm looks for clusters of high and low relevance, and uses them to boost or lower the individual sentence scores. A low-scoring sentence in a highly relevant cluster is probably more informative than its score says and should have its score boosted. Similarly, a high-scoring sentence in an irrelevant cluster is probably a mistake and should have its score reduced.

In addition, DCESumm implements the clustering itself in a better way for improved relevance modeling. Instead of a traditional distance-based algorithm like K-means, it mixes supervised approach and unsupervised clustering to learn a cluster topology that's customized for the legal domain.

DCESumm Pipeline Architecture

The DCESumm pipeline is shown below:

DCESumm pipeline for legal document summarization
DCESumm pipeline (Source: DCESumm)

Its components are explained in the following sections.

1. Sentence Representations & LEGAL-BERT Model

The first pipeline component takes the target legal document, extracts all its sentences, and converts each sentence to a contextual vector embedding.

For embeddings, practitioners typically use a Bidirectional Encoder Representation from Transformers (BERT) model, one of its derivatives, or even large language models like GPT embeddings. DCESumm just reuses a pre-trained LEGAL-BERT model for this.

LEGAL-BERT is trained on legal datasets to learn the terms, concepts, phrase patterns, and sentence structures prevalent in legal documents. So, its embeddings work better on legal documents compared to the standard BERT that's trained on general text corpora.

LEGAL-BERT Huggingface

Since LEGAL-BERT produces an embedding for every token in a sentence, a sentence-level embedding is generated by averaging all the token embeddings. Conceptually, this approach is similar to Sentence-Transformers, and an equivalent model from the latter can be used instead of LEGAL-BERT.

2. Sentence Relevance Scoring Neural Network

Sentence relevance scoring neural network
Sentence relevance scoring multi-layer perceptron (Source: Jain et al.)

The next component is a trained multi-layer perceptron (MLP) network that calculates a relevance score for every sentence from its embedding vector.

Architecture-wise, it's just four dense layers with a dropout layer between each pair to improve the generalization. The input to this model is a single sentence embedding. Its output sigmoid neuron calculates a relevance score for the sentence as a probability.

Training the Relevance Scoring Network

Billsum dataset
Billsum Dataset

For training this network, the training and testing datasets are derived from an extractive summarization dataset like BillSum as a dataset for binary classifiers:

  • Gather the set of legal documents and their corresponding reference summaries.
  • For each training document, extract its sentences.
  • For each sentence, calculate the ROUGE-1, ROUGE-2, and ROUGE-L recall and F1 scores against the parent document's reference summary.
  • Average the three recall x F1 products. This is the sentence's score.
  • Repeat this for all the sentences in all documents against their corresponding summaries.
  • For each document, select the sentences that score among the top 20%. Also select all the sentences in the summary. Their sentence embeddings are the X-values of the training set.
  • If a selected sentence is present in the summary, its label in the training set must be one. If it's not present, the label is zero.

This gives us a training set where each row has a sentence embedding and a zero or one label to indicate its absence or presence in the summary.

The neural network network is trained on this dataset. Its learned weights form a model that's able to predict whether a sentence, represented by its embedding, is likely to be in the document's summary, and if so, with what probability. This probability is its sentence relevance score. What's nice about this workflow over something like GPT-4 summarization is you have a concrete value you can reference for how valuable a specific sentence is. When summarizing with GPT models you don’t have that key insight that can be used to iterate and improve models. Clear evaluation metrics towards the goal state output are critical for iterating models in production.

3. Score Enhancement With Deep Embedded Clustering

DCESumm's clustering is based on the deep embedded clustering approach. It consists of multiple deep neural networks as explained below.

3a. Autoencoder for Dimension Reduction

The first component is a stacked autoencoder network that reduces a high-dimensional embedding to a low-dimensional feature vector that K-means clustering can handle.

This network consists of the following layers:

  1. Input layer with as same number of neurons as the embedding dimensions (768 by default for LEGAL-BERT)
  2. Dense encoder layer with 64 neurons
  3. Dense encoder layer with 128 neurons
  4. Dense encoder layer with 256 neurons
  5. Dense encoder hidden layer with five neurons
  6. Dense decoder layer with 256 neurons
  7. Dense decoder layer with 128 neurons
  8. Dense decoder layer with 64 neurons
  9. Output layer with same size as the input layer

It's trained on the sentence embeddings, and since it's an autoencoder, the input and output are the same embeddings. Its sole purpose is to produce a high-quality five-dimensional representation at the fifth layer for a given high-dimensional sentence embedding. Once trained, only the encoder stack till the fifth layer is used.

3b. Cluster Initialization

Autoencoder, dimension reduction, and K-means (Source: Jain et al.)

From its previous phases, the pipeline already has the sentence embeddings of a document. Each embedding is reduced to a five-dimensional feature using the autoencoder. These five-dimensional embeddings are run through standard K-means to obtain an initial set of sentence clusters and cluster centroids.. The initial number of clusters is set to 35% of the number of sentences in the document.

3c. Cluster Optimization

Cluster optimization on KL-divergence loss (Source: Jain et al.)
Cluster optimization on KL-divergence loss (Source: Jain et al.)

Normally, K-means optimizes clusters by repositioning their centroids such that L2 distances to their member points are minimized. The implicit assumption is that the feature space topology is Euclidean.

But since it may not actually be Euclidean, a more general approach is to use convex optimization like stochastic gradient descent (SGD) to incrementally optimize the cluster assignments. This has the added advantage of simultaneously improving the sentence representations too. The technique is explained in detail next.

The cluster labels predicted by K-means form a probability distribution. They're considered as soft cluster assignments to be updated in the next iteration.

From them, a second probability distribution is derived using the student's t-distribution. This is called the auxiliary target distribution which is considered as hard cluster assignments.

The commonly used difference metric between any two probability distributions is the Kullback-Leibler divergence (KLD). In every iteration of the SGD, the KLD is minimized by nudging the soft cluster assignments towards the hard cluster assignments. Then the representations are updated and cluster centroids are recalculated.

This is repeated till the soft and hard cluster labels are within a threshold. The cluster assignments are then frozen and used for the next step.

3d. Enhanced Sentence Scores from Cluster Scores

At this point, have a set of optimized clusters, each consisting of a set of low-dimensional sentence embeddings and a cluster centroid in that feature space.

First, the sentence relevance scoring model from before is reused on the updated sentence representations to calculate new sentence relevance scores. The overall cluster score is the median of these new relevance scores.

Finally, the new sentence relevance scores are weighted by the cluster scores, as shown in this formula:

Sentence relevance score enhancement formula
Sentence relevance score enhancement formula (Source: Jain et al.)

If a cluster's relevance score is high, it'll boost the scores of all its constituent sentences. But if it's low, it'll bring down all its sentence scores too.

4. Final Extractive Summary

For each document, the sentences are sorted by their enhanced relevance scores, and the top N sentences are selected as the extractive summary. N here depends on the average summary length of the selected dataset.

DCESumm Results of Legal Document Summarization

Generated summaries are evaluated using either reference-based metrics or reference-free metrics.

Reference-based metrics compare some characteristics of a generated summary against its reference summary. For example, ROUGE and BLEU scores measure n-gram overlaps between them while ignoring the semantic meanings they convey, which is fine for strictly extractive summaries but not for loosely extractive or abstractive summaries.

Other reference-based metrics like BERTScore and MoverScore do consider the semantic similarity between the summaries by evaluating them using trained language models. However, when evaluating domain-specific vocabulary like that in legal summaries, these models may score wrong due to training on generic datasets.

In general, reference-based metrics don't seem to match the subjective evaluations of summaries by people. So, more modern approaches use reference-free metrics like SUPERT or SummaC to evaluate the linguistic aspects of summaries like their factuality, faithfulness, semantics, and genericity using characteristics that people use for subjective evaluations.

Since DCESumm produces strictly extractive summaries, the paper evaluates them against the reference summaries using simple ROUGE scores.

The paper scores better on the test datasets compared to its alternative baseline methods:

Comparison with baseline models for legal document summarization
Comparison with baseline models (Source: Jain et al.)

It also performs better than other state-of-the-art deep learning models on extractive reference summaries:

Comparison with state-of-the-art models for legal document summarization
Comparison with state-of-the-art models (Source: Jain et al.)

An example generated summary against its reference summary from BillSum is shown below:

Example generated summary
Example generated summary (Source: Jain et al.)

State-of-the-Art Legal Document Summarization for Your Business

In this article, we explored a state-of-the-art summarization system that uses innovative techniques to evaluate the relevance and informativeness of sentences.

At Width, we have extensive experience in implementing legal document summarization techniques as well as other natural language processing (NLP) pipelines to improve legal document understanding, including using the latest large language models like GPT-4.

Contact us to know how your law practice can improve its productivity using such techniques.


  • Jain, D., Borah, M.D. & Biswas, A. "A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering." Artif Intell Law (2023). https://doi.org/10.1007/s10506-023-09345-y
  • Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos (2020). "LEGAL-BERT: The Muppets straight out of Law School." arXiv:2010.02559 [cs.CL]. https://arxiv.org/abs/2010.02559
  • Anastassia Kornilova, Vlad Eidelman (2019). "BillSum: A Corpus for Automatic Summarization of US Legislation." arXiv:1910.00523 [cs.CL]. https://arxiv.org/abs/1910.00523
  • Junyuan Xie, Ross Girshick, Ali Farhadi (2015). "Unsupervised Deep Embedding for Clustering Analysis." arXiv:1511.06335 [cs.LG]. https://arxiv.org/abs/1511.06335