Our SOTA GPT-4 Medical Record Summarization Pipeline

Patrick Hennis
August 21, 2023

high level GPT-4 summarization pipeline

A major problem in health care is the amount of time spent on paperwork. Many facilities still rely on paper records and migrating them to electronic health records is not that simple.

Digitized paper records are full of complexities like bad handwriting, handwritten notes and markings, a variety of page layouts from forms to tables, image quality and noise issues, irrelevant text like footers, and more. Overcoming such problems requires complex processing pipelines that combine the latest techniques in large language models, natural language processing, and computer vision.

In this article, we explore one such pipeline called GPT-4 medical record summarization that's capable of reliable digitalization of paper health records, with a lot of potential cost and time savings.

What Are Health Records In Our Use Case?

We'll briefly explain some basics of health records and their contents here with a focus on hospital inpatient health records.

Health Records vs. Medical Records

A medical record is a record of a patient's single encounter with medical care in a hospital.

In contrast, a health record is a comprehensive collection of all aspects of a patient's health over an extended period of time and across multiple care providers. Health care is broader than medical care, covering mental health, nutrition plans, health insurance, and other such aspects.

A health record typically contains many medical records, lab reports, and medical images.

In this article, we use the terms "health record," "medical record," and "medical report" interchangeably.

The Information Inside Health Records

A typical health record consists of different kinds of information:

  • Sections: Each section focuses on a particular area of health care, like medications, pathology, surgery, and so on. A health record will have multiple sections.
The medications section of a health record
The medications section of a health record (Source: AHIMA)
  • Records: These document specific aspects of a patient's medical care, like the administration of anesthesia or details of surgeries.
A medical record in a health record
A medical record in a health record (Source: AHIMA)
  • Assessments: These are also medical records with details of a first consultation or an initial assessment by a medical care provider.
  • Reports: These are typically lab tests and medical imaging reports.
A lab report in a health record
A lab report in a health record (Source: AHIMA)
  • Forms: A form consists of information laid out for patient input. Examples include biographical details and consent forms.
  • Flowsheets: Flowsheets track a patient's progress over time on aspects like vital signs, medications, and lab results.
A flowsheet in a health record
A flowsheet in a health record (Source: AHIMA)
  • Clinical text or clinical notes: Many sections and flowsheets contain medical opinions as free-form, unstructured text filled out by doctors or other care providers.
Sample clinical notes in a health record
Sample clinical notes in a health record (Source: AHIMA)

Electronic, Digitized, and Paper Health Record Formats

Other important aspects are the media and formats in which health records are stored:

  • Paper health records: They're stored on physical paper with printed and handwritten content.
  • Digitized health records: They're stored as digital scans of paper records in file formats, like portable document format (PDF), or image formats like the tag image file format (TIFF). Though these formats have a structure, the information in them is not stored as structured information that can be queried easily.
  • Electronic health records (EHR) and electronic medical records (EMR): These are stored on digital media as structured information in easily queryable databases. They're created and accessed using special EHR/EMR software.

Use Cases of Medical Report Summarization and Other Processing

In this section, we'll briefly go over how health records are used and by whom.

The primary use case of patient records is for clinical decisions and treatment plans by health care professionals. The records enable health care providers to render high-quality patient care because they contain the entire, objective history of patient data without having to rely on that information from patients.

Medical report summarization helps providers optimize their time by focusing on the most important details of a patient's record.

Since GPT-4 is capable of generating structured data, it also enables understaffed and under-resourced medical practices in rural areas to speedily convert all their paper records into queryable EHRs.

Another way the health care industry uses patient records is to inform overall improvements to the health care system through patient surveys, experimental treatments, new medicines, and so on.

Health records are also useful for administrative purposes like automated invoice generation for patients. Patients can also avoid billing fraud by running their patient records through third-party evaluations.

In some countries, health insurance is a major and essential component of the health care system. Patient records help both patients and care providers with tasks like insurance policy evaluations and prior authorizations.

Finally, health records and their data processing come up in the context of protecting patient privacy and confidentiality to comply with regulations like the Health Insurance Portability and Accountability Act (HIPAA). The data management concerns here focus on anonymizing the data in health records and analyzing it at aggregate levels rather than individual levels.

In the sections below, we explain a typical medical report processing pipeline using medical summarization as an example.

Width.ai Medical Record Summarization Pipeline

The illustration below shows our medical report summarization pipeline based on GPT-4 and other deep-learning models.

Width.ai medical record summarization pipeline

The pipeline consists of these components:

  • Medical report splitter
  • Document preprocessing for text extraction
  • Text extraction
  • Text preprocessing
  • Page classification
  • GPT-4 summarization
  • GPT-4 summary combinator

Each of these components uses different substeps and even different deep learning models.

Report Splitting

The digitized health record is processed page by page in the initial stages. Related pages can be recombined later through page classification and combinator module. But first, the record is split into pages or chunks suited to the document's file format. PDF documents are easy to split into pages. For image formats like TIFF, we use object localization models to identify and locate the page boundaries.

Document Preprocessing for Text Extraction

Text extraction is done using custom optical character recognition (OCR) approaches focused on learning both text extraction and positional OCR information. Applying suitable preprocessing to the digitized pages of a health record improves the accuracy of text extraction in the next stage.

In this section, we cover some of the common preprocessing techniques used to improve text extraction.

Preprocessing for Character Recognition

The following image preprocessing operations are applied to the digitized health record images to improve character recognition for both printed and handwritten text:

  • High-Resolution Image Synthesis with Latent Diffusion Models: High resolutions improve the accuracy of character recognition. If the scans are not high resolution (at least 1024p), we use a custom image enhancement integration to scale up the resolution. This research paper outlines Stable Diffusion based methods of image enhancement (Source)
  • Grayscale conversion: All images are converted from RGB colorspace to grayscale.
  • Noise removal: Noisy pixels are removed using hybrid contrast enhancement that combines local binarization with the preservation of the surrounding grayscale.

Preprocessing for Handwriting Recognition

Image preprocessing techniques to improve handwriting recognition are based on the paper, "Continuous Offline Handwriting Recognition Using Deep Learning Models". They include:

  • Slope and slant correction: Slope (the angle of the text from a horizontal baseline) and slant (the angle of some ascending and descending strokes from the vertical) are detected and corrected.
Slant and slope recognition in OCR
Source: Sueiras
  • Height normalization: The ascender and descender heights throughout the handwritten sections are identified and scaled to a standard ratio.
region recognition for height normalization
Source: Sueiras

Text Extraction and Layout Understanding

Text extraction identifies all the printed or handwritten characters on a digitized image, combines them into syntactic elements like words and punctuation marks, and returns them as text fragments like words, phrases, and sentences.

Character identification using just the image pixels can often be inaccurate if the image has text with poor handwriting (a widespread problem in the medical field), image defects, blurring, missing pixels, glare, and similar imaging artifacts.

State-of-the-art text extraction uses multi-modal language-vision models. They don't identify characters based on just the image pixels. Instead, they use a lot of additional information like:

  • The surrounding characters
  • The probable parent word, based on a vocabulary of words in a powerful language model trained on massive volumes of text
  • The semantic consistency of the parent word in its surrounding context of words based on the same language model

All these additional criteria drastically improve the accuracy of the text extraction, including from the tough handwritten sections of a digitized health record.

Below, we explore some state-of-the-art text extraction approaches.

LayoutLMv3 Model

LayoutLMv3 architecture
LayoutLMv3 architecture (Source: Huang et al.)

LayoutLMv3 is an OCR-based image-text model that's been trained for document layout tasks. Given a document image, it identifies layout elements like sections, field-value pairs, or tables and produces their bounding boxes and text as results.

The architecture is a pure transformer model with no convolutional elements. During training and fine-tuning, it must be supplied with the word embeddings, image patches, and position embeddings (from an off-the-shelf OCR package) of the training documents. Its multi-modal transformer model learns to associate patterns in the document text with their appropriate layout elements.

For fine-tuning, we supply a small dataset of annotated digitized health records. The pre-trained model adjusts its layer weights to activate for the layouts, layout elements, and text found in health records.

Document Understanding Transformer Model

OCR-free document understanding transformer
OCR-free document understanding transformer (Source: G. Kim et al.)

The document understanding transformer (Donut) is an alternative text extraction approach that is OCR-free. That means it does not use or generate information at the character level. Instead, it learns to directly generate text sequences from visual features without producing intermediate information like character labels and text bounding boxes.

Donut has a typical encoder-decoder transformer architecture:

  • The encoder is a language-image transformer block that recognizes the text in an image implicitly and generates embeddings for it.
  • The decoder is a transformer block that generates relevant structured text from the encoder’s embeddings.

For example, for the downstream task of document layout understanding, the decoder produces an output structured sequence like “<layout><section><heading>medications</heading><line><fragment>Aspirin</fragment> <fragment>10mg</fragment></line></section></layout>” as its output sequence.

Since Donut doesn't use OCR at all, this approach is faster and lighter with far fewer parameters than OCR-based models. It also reports high accuracy.

Text Extraction Challenges

The text extraction must not only understand the regular text but also special marks like check marks and circles. In the example below, a doctor has selected "Y" as their choice but the text extraction model has ignored it.

Special hand-drawn marks
Special hand-drawn marks (Original source: AHIMA)

Such unusual data outside the purview of regular text extraction is the reason we fine-tune image-language models for our document understanding pipelines. These recognitions can be trained into the image to text architectures outlined above to produce a single pipeline. We recommend processing these recognized marks into text that then can be learned in the language model.

Text Preprocessing for Medical Reports

The extracted text fragments may not be in an ideal state for medical processing use cases. For example, the text in signatures, page numbers, seals, logos, or letterheads just acts as noisy text that doesn't add any relevant information to health report summaries but may affect the quality of the summaries or extracted information fields.

Text preprocessing uses deep learning models to ignore such noisy text. One approach we use is fusing the approach of language model pretraining over clinical notes for our text extraction model to teach it to recognize the text patterns unique to health records.

Unlike the paper, our approach does not use long short-term memory models but instead adapts its sequential, hierarchical, and pretraining approaches to our multi-modal text extraction model. The approach involves fine-tuning the layout understanding model with additional embeddings for medical information like section labels and diagnosis codes.

Page Classification

Determining section labels for each page is an essential step for the accurate processing of digitized paper records.

As we saw earlier, every section in a health record has a distinct structure and set of fields. Not all sections can be processed the same way. The processing heavily depends on the nature of the medical information in a section, its structure, and the goals of the health care professional doing the processing. For example:

  • For some sections, the clinical text requires extractive summarization. For others, abstractive summarization may suffice if it doesn't introduce any medical risks.
  • Forms and assessments may require named entity recognition.
  • Clinical images may be processed to generate Informative diagnoses as text.

The appropriate GPT prompts and models for each section and use case are also different. So, every page is labeled with appropriate section labels and additional goal-specific labels by a page classification model.

Some examples of labels are:

  • Section labels, like medications page or discharge summary
  • Form labels, like patient details and consent forms
  • Flowsheet labels, like nursing care or intravenous therapy flowsheets
  • International Classification of Diseases (ICD) codes

The classification models that label an input health record are implemented in one of two ways explained next.

1. Zero-Shot Classification With GPT-4

GPT-4 is already trained on medical corpora and is capable of scoring high in medical examinations. As such, it's inherently capable of classifying each page of a health record based on that page's text contents. For labels that are simple and obvious, straightforward prompt instructions are sufficient; we don't even have to provide any examples as few-shot guidance.

2. Classification Using Similarity Search

For some use cases, we need special labels that zero-shot classification is unable to classify accurately. To handle them, we maintain a reference set of manually labeled sections and examine how similar an input record's section is to each section in that set. The reference sections that score high on content similarity with the input section are selected and their labels (manually set) are chosen as the labels for the input section.

Implementation-wise, we determine content similarity using vector similarity metrics like cosine similarity. The reference sections as well as the input sections are converted to embedding vectors using either OpenAI embeddings or Sentence-BERT. The reference embeddings are stored in a vector database like Pinecone and queried for vector similarity with an input section. The database returns the most similar reference sections and their labels.

GPT-4 Summarization of Sections and Clinical Text

In this stage, GPT-4 prompts are used to summarize the information on each page.

For some pages, this involves abstractive summarization of the clinical text on the page. GPT-4 rephrases that text to a shorter abstract without losing any critical details.

For other pages, GPT-4 is used for extractive summarization. Key information is extracted verbatim from a page's content.

We show some page examples and their respective prompts in the sections below.

1. Medications Page

The medications page of a sample health record annotated by the text extraction model is shown below:

Text extraction from printed medications page
Text extraction from printed medications page (Original source: AHIMA)

GPT-4 Summarization Prompt for Medications Page

We ask GPT-4 to summarize the medications page with this prompt: "Summarize the list of medications in this extract from a medications page of a health record."

Medications prompt (Source: ChatGPT)
Medications prompt (Source: ChatGPT)

GPT-4 Summary of Medications

GPT-4 generates the following summary:

Generated medications summary
Generated medications summary (Source: ChatGPT)

We can see that the dosages in the summary are missing. This is because the text extraction pipeline we used here did not keep all the information on a line together though the extraction model provides the pixel coordinates to do so. So, this is really the pipeline's drawback rather than the extraction or summarization model's, and it can be easily fixed.

2. Focus Notes Page

This focus notes page of a sample health record contains a lot of difficult-to-read handwritten text and has been annotated by the text extraction model:

Text extraction from focus notes page
Text extraction from focus notes page (Original source: AHIMA)

Note that the text extraction model has misidentified words like "Pt." (for "Patient") as a meaningless "R t." This is because the model used here has not been fine-tuned on medical records.

GPT-4 Summarization Prompt for Focus Notes

We ask GPT-4 to summarize the focus notes page with this prompt: "Summarize the following extract from the focus notes of a health record:"

Focus notes prompt
Focus notes prompt (Source: ChatGPT)

GPT-4 Summary of Focus Notes

GPT-4 generates the following summary:

Focus notes summary (Source: ChatGPT)
Focus notes summary (Source: ChatGPT)

GPT-4 has done a great job of summarizing the focus notes here. Although the extracted text is not in the same order as the page layout, GPT-4 recombined and organized the information in a coherent and structured way by itself while ignoring the unnecessary details.

3. Patient Details Form

The crucial patient details form of a sample health record has been annotated by the text extraction model as follows:

Text extraction from patient details form
Text extraction from patient details form (Original source: AHIMA)

Notice how it has accurately identified both printed and handwritten text.

GPT-4 Summarization Prompt for Patient Details

We ask GPT-4 to summarize the patient details with this prompt: "Summarize the details in this patient details form from a health record:"

Patient details prompt
Patient details prompt (Source: ChatGPT)

GPT-4 Summary for Patient Details

GPT-4 generates the following patient details summary:

Patient details summary (Source: ChatGPT)
Patient details summary (Source: ChatGPT)

Note that even in cases where the field name and field value are not together in the extracted text because of pipeline drawbacks, GPT-4 has intelligently correlated them:

Medical record number in the original record on the left. Its locations in the extracted text. GPT-4 has correctly correlated them again in the summary! (Original source: AHIMA)

GPT-4 Summary Combinator Model

GPT-4 summary combinator model

The combinator module generates the final section-level summaries. For sections that span multiple pages, it combines their page summaries into a single coherent section summary. While doing so, it doesn't just squish multiple summaries together naively. Instead, it condenses their information by removing any duplicated details and generates a concise section summary that does not feel choppy.

The combinator is implemented as a custom fine-tuned transformer model with either GPT-4 or another language model like BERT as the base model. Fine-tuning enables us to generate high-quality final summaries. It also lets users tweak the size and quality of each summary because everyone has a different idea of what an ideal summary looks like for their specific use case.

Medical Report Summarization for Document Processing Company

Using medical report summarization as a use case, this article showed a typical report processing pipeline that uses incredible advances in large language models to streamline health care operations.

In addition to summarization, many other high-quality artificial intelligence (AI) and natural language processing solutions for health records are possible now, like question-answering chatbots and powerful search engines. Contact us to explore how you can streamline operations in your hospital, lab, or medical practice with modern AI technologies.


  • Jorge Sueiras (2021). "Continuous Offline Handwriting Recognition using Deep Learning Models." arXiv:2112.13328 [cs.CV]. https://arxiv.org/abs/2112.13328
  • Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei (2022). "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking." arXiv:2204.08387 [cs.CL]. https://arxiv.org/abs/2204.08387
  • Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park (2021). “OCR-free Document Understanding Transformer.” arXiv:2111.15664 [cs.LG]. https://arxiv.org/abs/2111.15664
  • Jonas Kemp, Alvin Rajkomar, Andrew M. Dai (2019). "Improved Hierarchical Patient Classification with Language Model Pretraining over Clinical Notes." arXiv:1909.03039 [cs.LG]. https://arxiv.org/abs/1909.03039