In-Depth Guide to Patient Record Summarization With Large Language Models (With Examples)

Matt Payne
January 21, 2024
patient health medical record

In this article, we explore our Width.ai patient record summarization pipeline that can:

  • Analyze a variety of patient record layouts and formats
  • Reliably extract essential medical data like names, diagnoses, International Classification of Diseases (ICD) codes, medications, and dates
  • Summarize all the extracted information as a simple timeline

Patient Record Summarization Goals

In this task, we're interested in extracting five essential pieces of information from patient reports:

  1. Names of medical personnel who provide care or prescribe medication
  2. Medical diagnoses or patient care
  3. International Classification of Diseases (ICD) codes for the diagnoses
  4. Medications prescribed to the patient
  5. Dates of visits, diagnoses, and any other medical activity

Many pages of a report may have all this information. We want to extract from every page and summarize all the extracted information in a single timeline.

Challenges of Patient Records

The patient reports can be PDF or image files. PDFs can be of two types:

  • Scanned PDFs: These are scans of reports with all the text present only as pixels in the scanned images. Such files require an optical character recognition (OCR) step first to convert the text pixels into text characters. The text may be printed or handwritten.
  • Searchable PDFs: These are proper PDFs with directly extractable and searchable text. OCR may not be necessary.

Five sample patient records are analyzed in the sections below.

1. Scanned Report With a Simple Layout and Sparse Details

scanned medical report

Although it looks like a simple layout, these types of documents can cause issues for basic OCR implementations or Amazon Textract. They have column based key value pairs, left to right key value pairs, and sections that use a bit of both. The red box is one of the best ways to show this.

2. Scanned Report With a Simple Layout but Dense Information

patient record processing with ai

One of the most difficult parts of extracting information from these documents is very long tables. This is due to the fact that these systems have to correlate fields to the column titles with positional information and NLP understanding of the titles. As these move farther away from each other it becomes harder for the positional information to correlate the two and not confuse the fields with values outside of the table. At some point the model has to wonder if the field text that is far away from the column title is even a field!

3. Searchable Report With Semi-Complex Layout

scanned patient record with key value pairs

This is one of the most interesting patient records to process into a structured format. This record format contains very nested key value pairs that have multiple layers to them. Headers come in multiple different boldness levels and sizes, which makes it even more difficult to recognize and extract.

Extracting the text in the schema that the document follows is critical for being able to use this information downstream. If the text is extracted in a way that doesn’t help us understand what information correlates to other related information, we lose understanding of what values mean to us. Let me show you what I mean and how it affects us.

output of above document processing

Standard OCR extracts the text left to right with line breaks where necessary. We can see that extracting tables left to right doesn’t work and puts the read order of information out of order. If we were using an LLM downstream to extract information from this document it would greatly struggle.

AWS Layout Extraction is supposed to be the solution for this. It extracts the text in a format that fits the read order of the document and labels to what type of text the text is (text,headers,table).

AWS Layout extraction

If you check the above document you’ll see that it misses a ton of text and has much worse OCR than the raw text extraction above. But the key issue is that the read schema isn’t entirely correct. The text isn’t correlated to any specific headers, and sub headers are not correlated to headers. These documents read more like nested structures where the keys above (headers) are relevant to the smaller and more precise values. The key/value pairs and headers to sub headers do not exist in a vacuum. The format that AWS Layout Extraction pulls this data into isn’t really useful. It should recognize that “Admission Information” is not the same level as “Visit Information” and that “Admission Information” is actually a subsection to “Visit Information”. This once again causes a ton of issues for LLMs downstream to understand how the data correlates together and is supposed to read.

json output

This is how the data should look with a nested schema that clearly defines what is a parent of what, what fields are children, and what text is a key and value. This is an extraction from a similar document ran through our medical document processing pipeline. You can see each section gets its own key value set under each header.

4. Scanned Report With Complex Layout

scanned two column layout

This patient record has an unusual two-column layout. These type of documents can be challenging as they do not represent a large percentage of the data variance seen in this use case, which means that document processing systems don’t have much training data to learn the layout, or systems just don’t handle this mapping well. For us it’s the exact same use case as our resume parser which achieved over 98% on two column resume formats.

5. Report With Handwritten Notes

patient record with handwriting

The scanned report above has an additional block of handwritten entries at the bottom. We’ll talk more about it further down.

Additional Technical Challenges

The technical challenges of this task are the same ones that most real-world document digitization and information extraction projects run into:

  • Handwritten information: Notes added by hand must be assumed as medically consequential and processed accurately. However, health care professionals are busy people and are unlikely to follow any near-logical layout that's easy to process. A related problem is that of handmade insertions or corrections above printed text. The system must infer that although the handwriting is in its own line, it's meant to replace or modify some part of the printed line below. The extracted text must include such insertions at the right positions.
  • Checkboxes and other special marks: These reports contain special marks like check marks, cross marks, and circles which may be medically important and must be transferred correctly to the extracted data. The challenge here is to identify that some text, which may not be marked at all, is actually a field which hasn't been checked or circled and the text adjacent to it or on the same row is the field name. Since there may not be any marked fields, the system must rely on layout patterns to infer such special fields.

Width.ai Patient Record Summarization Pipeline

The illustration below shows our patient report summarization pipeline based on an LLM with a prompting framework and other deep-learning models.

Width.ai patient record summarization system

The pipeline consists of these components:

  • Patient record page splitting
  • Document preprocessing for text extraction
  • Text extraction and layout understanding
  • Text preprocessing for medical use cases
  • Section classification
  • LLM summarization
  • Post-processing

We explore each step in detail in the sections below.

Patient Record Page Splitting

Each PDF is processed page by page. Sections that span multiple pages are recombined later with help from the page classification and combinator module. This is the first place where pipelines start to fall apart. These medical documents can contain 1000s of pages and need to be processed down not only for length or size reasons, but also for keeping relevant context tightly together. LLMs become more generic as the context windows grow (sorry 100k prompt size models) and we need to avoid that for accurate summaries. Splitting this information up allows us to control these things much easier without running into edge case issues that ruin our pipeline.

Understanding where to split pages is an equation in itself. If you split a document in the wrong spot there’s no way for the downstream models to fix the issue and use the missing context. This means that part of the accuracy of the downstream models is directly tied to our system here. The record splitting algorithm is directly baked into our framework.

Document Preprocessing for Text Extraction

As listed earlier, some of the scanned PDFs & images may be of poor quality. Our text extraction pipeline uses a custom vision-language model that does text extraction and OCR simultaneously to reduce spelling mistakes and other misidentification errors common in OCR-only approaches. Ensuring that the input images to this pipeline are of a high quality greatly improves the accuracy.

We'll explain some of the common preprocessing techniques we use to improve text extraction.

Preprocessing for Character Recognition

Image enhancement for document processing

These image preprocessing operations are applied to the scanned patient record PDFs to improve the accuracy of character recognition:

  • Convert to grayscale: All the images are converted to the grayscale colorspace to reduce color-based inaccuracies.
  • Remove noise: Noisy pixels in the scans are removed with hybrid contrast enhancement that minimizes information loss by applying binarization locally while preserving the surrounding grayscale.
  • Synthesize high-resolution images: We use latent diffusion models to generate high-resolution images of the scanned images because they improve OCR accuracy.

Handwriting Localization

Since there's a chance that a page may contain handwritten notes, we use a fine-tuned object detector R-CNN model to localize blocks of handwritten text. Luckily, since handwritten text patterns are often drastically different from printed text, the detection accuracy tends to be high.

handwritten section recognition

The primary challenge here is when handwritten single words or short phrases are inserted at some position above a printed line. Such inserted text may change the meaning of the text in medically consequential ways and must be merged with the main text based on their locations.

Preprocessing for Handwriting Recognition

After the handwritten regions are identified, special image preprocessing techniques are applied to improve the accuracy of handwriting recognition. These techniques are based on continuous offline handwriting recognition using deep learning models. The techniques include:

  • Slope and slant correction: Slope (the angle of the text from a horizontal baseline) and slant (the angle of some ascending and descending strokes from the vertical) are detected and corrected.
  • Height normalization: The ascender and descender heights along the handwritten sections are calculated and scaled to a uniform ratio.

Handwriting Recognition Results

The results of handwriting recognition with an off-the-shelf simple text recognition library are shown below:

Notice that its handwriting results are not accurate.

In contrast, these are far better results from our fine-tuned, multi-modal, language-image model:

Text Extraction and Layout Understanding

Text extraction consists of the following steps:

Character identification based only on image pixels can be incorrect if the scanned image contains either imaging artifacts like missing pixels, noisy text, and blurring or poor handwriting (a widespread problem in the medical field).

To avoid such problems, state-of-the-art text extraction uses multi-modal language-vision models. They don't identify characters based on just the image pixels. Instead, they use a lot of additional context like:

  • The surrounding characters
  • The probable parent word, based on a massive vocabulary of words of a powerful large language model (LLM) trained on gigabytes of text including documents related to health care
  • The semantic consistency of the parent word within its surrounding context of words based on the same LLM

This additional context drastically improves the accuracy of the text extraction from both printed and handwritten sections.

Below, we explain our state-of-the-art text extraction approaches.

Deep Learning Model for Text Extraction

Text extraction and layout understanding are done by our vision-language model that's trained for document layout recognition based on OCR images. Given the image of a page, it identifies layout elements like sections, field-value pairs, or tables and produces their bounding boxes and text.

During training and fine-tuning, it's supplied with word embeddings, image patches, position embeddings, coordinate embeddings, and shape embeddings generated for the training documents. The multi-modal transformer model learns to associate spatial and word patterns in the document text with their appropriate layout elements.

Results of Layout Understanding

A sample run of our model on a patient record is shown below:

document extraction

In the example above, the model correctly identified all the free-floating name-value fields at the top of the page, including fields like age, which don't have any values.

The model also identified distinct logical sections in the text based on their layout and formatting patterns. These sections are extracted as name-value pairs where the names are section headers and values are the text under the section headers. Sometimes, a paragraph under the previous section is identified as a separate section. However, such mistakes are easily corrected in the post-processing stage.

Handling Special Text

The text extraction must not only understand the regular text but also special marks like check marks and circles that may be medically relevant as shown below:

Such unusual data outside the purview of regular text extraction is the reason we fine-tune image-language models for our document understanding pipelines. These recognitions can be trained into the image-to-text architectures outlined above to produce a single pipeline.

Text Pre-Processing for Patient Records

So far, we identified the text from images, identified layout elements, and got the text within each element's boundaries. However, the extracted text fragments may not be directly suitable for patient record use cases. For example, the text in signatures, page numbers, seals, or logos is just noisy, irrelevant information that must be discarded.

Text pre-processing uses deep learning models to ignore superfluous text. We do that by borrowing the approach of language model pre-training over clinical notes for our text extraction model. Basically, we teach it to recognize the text patterns unique to health records and ignore the rest.

We adapt its sequential, hierarchical, and pre-training approaches to our multi-modal text extraction model. It involves fine-tuning the layout understanding model with additional embeddings for medically relevant information like section labels and ICD codes.

Structured Information Extraction

The process laid out above produces extracted information that's structured as JSON data like this:

GPT-4 Summarization of Extracted Information

In this stage, GPT-4 prompts and prompting techniques are used to summarize the extracted information from each relevant section. Since each of the fields we're interested in — names, ICD codes, diagnoses, medications, and dates — have distinct text patterns, we just prompt GPT-4 to identify which is which and add the entry to the summary timeline.

Simple Prompts:

For example, a patient record with relevant information in different locations is shown below:

The raw, unstructured extracted text without any location hints is shown below:

Despite the lack of location hints, GPT-4 correctly infers the details:

Extractive and Abstractive Summary Prompts

Abstractive summary prompts are useful for isolating the useful information from the rest of the text in a chunk. We use prompts like: "In the medical record fragment below, summarize only the information about names, dates, ICD codes, medications, and diagnoses, and discard all the other text."

Once the abstract summary is available, extractive prompts enable us to be more clinical. We use prompts like: "Select the ICD codes in this medical record corresponding without modifying them."

Advanced Prompting Techniques

Prompting techniques like tree-of-thought prompting, chain-of-thought prompting, and planning and executable actions for reasoning over long documents (PEARL) are useful for isolating useful information from unnecessary information in the text.

We use prompting frameworks like these when creating summaries of events that span across multiple documents. Many use cases of patient record summaries involve building timelines and single event understanding across multiple documents. These plan and execute frameworks help us manage the process of creating one goal state summary based on all the documents provided with different dates/times/locations that are correlated to a single event.

These frameworks also allow us to build summarization systems that are dynamic enough to be adjusted based on specific focus information. The system takes in specific areas of focus such as body region, ICD codes, or specific patient information and writes summaries that focus just on these topics. This allows you to write summaries of an entire patient history that focus on exact topics.


Since patient records can be lengthy, running into hundreds of pages, chunking strategies are necessary to break them up. Chunking on section boundaries, summarizing each chunk, and finally recursively summarizing the summaries enables us to create detailed summaries from lengthy patient records.


In the post-processing stage, we consolidate the summary timelines from all sections and arrange them to get a comprehensive set of health care events related to the patient.

For this patient record:

The generated summary timeline looks like this:

Want to leverage our Patient Record Summarization System?

In this article, you saw our state-of-the-art artificial intelligence approach to create a consolidated summary timeline of all the health care events recorded in a patient report. Such a summary spanning years, possibly decades, is extremely helpful for medical professionals, health care administrators, health insurance companies, legal firms, arbitration companies, and in court.

Contact us to find out how we use modern AI techniques to help you incorporate such information summarization workflows in your hospital, lab, or medical practice.


  • Jorge Sueiras (2021). "Continuous Offline Handwriting Recognition using Deep Learning Models." arXiv:2112.13328 [cs.CV]. https://arxiv.org/abs/2112.13328
  • Jonas Kemp, Alvin Rajkomar, Andrew M. Dai (2019). "Improved Hierarchical Patient Classification with Language Model Pretraining over Clinical Notes." arXiv:1909.03039 [cs.LG]. https://arxiv.org/abs/1909.03039