Many health care organizations and health systems still use paper-based medical records & bills that are scanned into PDF or image formats to turn them into a type of electronic medical records (EMRs). Converting them into proper EMRs by extracting all their information and turning that data into structured formats is a time-consuming but necessary task. It leads to better decision-making by all health care providers over the patient's lifetime in addition to satisfying regulators and insurance companies.
In this article, we take a look at our SOTA pipeline used for processing these long and complex medical records into a data format that can be used for EMRs. We’ve built this out based on countless builds in this domain and have mastered the challenges faced with these specific documents.
Challenges of Electronic Medical Records
The EMRs and electronic health records (EHRs) we'll talk about in this article are in PDF or image formats. PDF EHRs can be one of two types:
Scanned PDFs: These are scans of reports with all the text present only as pixels in the scanned images. Such files require an optical character recognition (OCR) step first to convert the text pixels into text characters. The text may be printed or handwritten.
Searchable PDFs: These are proper PDFs with directly extractable and searchable text. OCR may not be necessary, but some processing is required to fit a specific schema.
Some sample EMRs are analyzed in the sections below to help you understand the data quality challenges we face in EMR information extraction.
1. Scanned EMR With a Simple Layout and Sparse Details
Although it looks like a simple layout, such EMRs can trip basic OCR implementations or even cloud services like Amazon Textract. A common idea you see in this article is the idea of correlating information and grouping it based on readability. These documents are full of titles, section titles, key value pairs, and other values that require mapping to other information to fully understand what they mean. Even in the relatively simple document above you can see how this data relates. The column headers help with understanding the key value pairs, and the titles help with understanding the column headers.
2. Scanned Report With a Simple Layout but Dense Information
A major challenge of extracting patient data from EMRs involves processing tables of text. This is because we have to correlate fields to the column titles with positional information and natural language understanding of the titles. As columns get spread out, it becomes harder to correlate fields and values without misidentifying the fields with values outside the table. Tables are by far the most challenging part of this process. Due to the high variation of formats that providers use, its challenging to build pipelines that fully correlate this information.
3. Searchable EMR With Semi-Complex Layout
This kind of EMR poses interesting challenges to convert into a structured format. This health record data contains highly nested key-value pairs that have multiple layers to them. Headers come in different boldness levels and sizes, which makes them even more difficult to recognize and extract.
Extracting the text while retaining location information is critical for downstream use cases. If the text is extracted in a way that doesn’t help the models understand correlations between different elements, the extracted data will be inaccurate at worst or unusable at best.
In the example below, standard OCR has extracted the text left to right with line breaks where necessary.
We can see that extracting tables left to right doesn’t work and puts the read order of information out of order. If we were using a large language model (LLM) downstream to extract information from this document, it could produce inaccurate results.
AWS Layout Extraction is supposed to be the solution for this. It extracts the text in a format that fits the read order of the document and labels based on the nature of the text (like text, header, or table).
However, you can see that it missed a ton of text and has much worse OCR than the raw text extraction above.
But a bigger problem is that the read schema isn’t entirely correct. The text isn’t correlated to any specific headers, and sub-headers are not correlated to headers. These documents read more like nested structures where the keys above (headers) are relevant to the smaller and more precise values. The key-value pairs and headers to sub-headers do not exist in a vacuum. We need the data to read like this for any downstream processing and understanding. “Date” fields mean nothing to us unless we understand the context of where they are being used.
So, the format that AWS Layout Extraction identifies isn’t really useful. It should recognize that "Admission Information" is not the same level as "Visit Information" and that “Admission Information” is actually a subsection of "Visit Information." This once again causes many issues for downstream models to understand how the data correlate.
This is how the data should look with a nested schema that clearly defines what is a parent of what, what fields are children, and what text is a key and value. This is an extract from a similar document run through our medical document processing pipeline. You can see each section gets its own key-value set under each header.
4. Scanned Report With Complex Layout
This EMR has an unusual two-column layout. Such documents can be challenging as they don't represent a large percentage of the data variance we see, which means that document processing systems don’t have much training data to learn the layout, or systems just don’t handle this mapping well. For us, it’s the exact same use case as our resume parser which achieved over 98% on two-column resume formats.
5. Report With Handwritten Notes
The scanned report above has an additional block of handwritten entries at the bottom. We’ll talk more about it further down.
Additional Technical Challenges
The technical challenges of this task are the same ones that most real-world document digitization and information extraction projects run into:
Handwritten information: Notes added by hand must be assumed as medically consequential and processed accurately. However, health-care professionals are busy people and are unlikely to follow any consistent layout that's easy to process. A related problem is that of handmade insertions or corrections above printed text. The system must infer that although the handwriting is in its own line, it's meant to replace or modify some part of the printed line below. The extracted text must include such insertions at the right positions.
Checkboxes and other special marks: These reports contain special marks like check marks, cross marks, and circles which may be medically important and must be transferred correctly to the extracted data. The challenge here is to identify that some text, which may not be marked at all, is actually a field that hasn't been checked or circled and the text adjacent to it or on the same row is the field name. Since there may not be any marked fields, the system must rely on layout patterns to infer such special fields.
Data field differences based on page type: Most medical documents are a combination of pages from various sources and visits for the same patient. This means that fields that exist in the entire document have different meanings or relevancy based on the use case. If we were trying to summarize a person's current medical status we wouldn't care about medical information from 10 years ago or information from documents that aren’t relevant. This context and understanding needs to be extracted and understood. Here’s a simple example:
A model would extract the allergies section and return it as is based on our use case. But if we have later patient visit documents that list other allergy medications or the removal of this specific one then we wouldn’t want to include this in the results. This is also true when considering the page type that the medication exists on. That means that the entire document schema and understanding of the other fields is valuable when returning information.
EMR Data Extraction Pipeline
Our clinical data extraction pipeline combines vision, natural language processing, and multimodal models to address all these challenges in EMR data extraction. The data extraction process is shown below.
We explain our methodology for health information extraction in detail in the sections that follow.
EMR Page and Section Splitter
Some EMR PDFs can run into hundreds or even thousands of pages. Since the nature of information, layout, and medical specialization can be different on every page, we process each EMR page by page while doing internal page operations to extract the data in a better format.
- The format and schema of the document can change quite a bit page by page. If these are medical records that encompass pages from different facilities you might have a ridiculous amount of variation from page to page.
- Many of the downstream operations we do with this extracted data care about specific meta-data found at a page level. Lots of use cases surrounding chronological summaries or Q&A are reliant on date fields and location fields. If we split up these pages into smaller chunks we might lose that context if only a single date or address is provided.
- Chunks that are larger than a single page may incorporate too much information and include different visits, locations etc. While this isn’t the worst case for chronological ordering, splitting relevant context from a visit means that the context is also missing from the next chunk, which can lead to the information not showing up at all in the summary. This is the same idea I talked about in this dialogue summarization article. That being said, our most recent page title based approach reduces the issues that you can run into with the above approach.
At an internal level we use our table extraction, positional OCR, and row matching algorithms to better define the schema inside the document.
Document Preprocessing for Text Extraction
Our text extraction subsystem uses a fine-tuned multimodal vision-language model that simultaneously extracts text and does OCR. This reduces spelling mistakes and other misidentification errors of OCR-only approaches. Ensuring high-quality input images to this subsystem greatly improves its accuracy. For that, we apply a set of preprocessing steps to the EMRs to improve their image quality.
Let's look at these preprocessing techniques.
Preprocessing for Character Recognition
We apply these image processing operations to scanned EMRs to improve the accuracy of character recognition:
Grayscale conversion: All the images are converted to grayscale to reduce color-based inaccuracies.
Noise removal: We remove noise using hybrid contrast enhancement that keeps information loss to a minimum by applying binarization locally while preserving the surrounding grayscale.
Some pages may contain medically relevant handwritten sections or minor additions, strikethroughs, and markings.
Processing handwritten words or short phrases inserted somewhere above or below a printed line is challenging. Handwritten text can change the meaning of the printed text in medically consequential ways. So, they must be merged with the main text accurately based on their locations.
For all this, we use a fine-tuned region-based convolutional neural network (R-CNN) object detector model to identify the positions of handwritten text areas.
Preprocessing for Handwriting Recognition
After identifying the handwritten regions, more image preprocessing techniques are applied to those regions of interest to improve the accuracy of handwriting recognition. We use techniques based on continuous offline handwriting recognition using deep learning models which include:
Slope and slant correction: Slope (the angle of the text from a horizontal baseline) and slant (the angle of some ascending and descending strokes from the vertical) are detected and corrected.
Height normalization: The ascender and descender heights along the handwritten sections are calculated and scaled to a uniform ratio.
Handwriting Recognition Results
The results of handwriting recognition with an off-the-shelf text recognition framework are shown below:
Notice that its character recognition is not accurate.
In contrast, our fine-tuned, multimodal, language-image model produces far better results shown below.
Combine them into syntactic elements like words and punctuation marks
Return them as text fragments like words, phrases, and sentences.
Character recognition based only on pixels can go wrong if the scanned image contains missing pixels, noisy text, blurring, or poor handwriting (a common problem in health care).
To avoid such problems, state-of-the-art text extraction prefers multi-modal language-vision models. Rather than base their character recognition on just the pixels, they use additional context like:
The surrounding characters
The probable parent word, based on a massive vocabulary of words of a powerful large language model (LLM) trained on gigabytes of text, including health-care datasets
The semantic consistency of the parent word within its immediate neighborhood of words
This additional context greatly improves the accuracy of the text extraction from both printed and handwritten sections.
Section Classification
Based on the extracted text, we can classify each page according to the type of data into a category like:
Basic patient information and demographics
Flowsheet
Consultation note
Medication
Progress note
Patient care notes
Clinical decisions
Discharge summaries
Radiology reports
For classification, we use an LLM that's aware of health-care concepts. Depending on each client's data security and budget requirements, we either fine-tune a self-hosted open-source LLM or use an LLM service like OpenAI's GPT-4. The page classification helps us understand the overall schema of the medical document and break down information into buckets based on this schema. The goal is to add understanding to specific fields that change their value based on the page they’re on. An example would be a data field. These can be an admission date, discharge date, medical issue date, or others based on the page they’re on. We need this information to properly extract.
Layout Understanding
The layout understanding is done by our vision-language model which is trained for document layout recognition based on OCR images.
It works as follows:
During training and fine-tuning, we provide layout annotations, word embeddings, image patches, position embeddings, coordinate embeddings, and shape embeddings generated for the training documents.
The multi-modal model learns to associate spatial and word patterns in the document text with their appropriate layout elements.
During inference, when shown a page of the EMR, it identifies all the layout elements like free-text sections, field-value pairs, and tables.
It then infers their text and bounding boxes.
The Results of Layout Understanding
A sample run of our model on an EMR is shown below:
We can observe the following:
The model correctly identified all the name-value pairs at the top of the page, including fields like age, which don't have any values.
The model also identified distinct logical sections in the text based on their layout and formatting patterns.
Sections are extracted as name-value pairs. The section headers become the names and the text under the headers become the values.
Sometimes, a paragraph under a section may be misidentified as a separate section. Such mistakes are corrected in the post-processing stage.
Handling Special Text
In addition to regular or handwritten text, the text extraction subsystem must also understand special marks that may be medically relevant.
In this first example, the health care provider has circled "Y" for Yes.
Recognizing information like this is critical to the use case. Boxes being checked, conditionals being circled, and specific pages being signed by a physician can completely change the information required for downstream systems. In the above example, if you’re building a summarization system you need to understand which medications are circled yes and which ones are no to ensure the correct information is provided in the summary. If you ever plan on automating the process of extracting relevant information from these medical records, handling this text is a requirement.
In the next example, a printed form contains many check marks that must be accurately identified as checked or unchecked.
Such unusual data outside the purview of regular text extraction is why we fine-tune image-language models for document understanding. We fine-tune our image-to-text model for special marker recognition to produce a single multiple-capability model.
Text Pre-Processing for Medical Relevance
So far, we identified the text, carved out layout elements, and extracted the text of each element. However, these extracted text fragments may not be directly usable for information extraction and other downstream use cases. For example, text like page numbers and hospital logos are just noise to be discarded.
For that, we use the approach described in the language model pre-training over clinical notes for our text extraction model. Basically, we adapted the paper's sequential, hierarchical, and pre-training approaches to teach our model to recognize the text patterns and data sources unique to health records and ignore the rest. We fine-tuned our layout understanding model with additional embeddings for medically relevant information like section labels and ICD codes.
Named Entity Recognition (NER)
Within each layout element, there may be useful named entities like:
Diseases
Physiological conditions
Symptoms
Medications
To identify such named entities, we can use the same medical-aware LLM to do NER. This is one of the key benefits of a custom pipeline approach over using something like Textract or Azure document processing services. We can customize the training to recognize specific values instead of the system extracting them as just normal key/value pairs. This helps us downstream as we’ve already identified specific data we care about.
Structured Information Extraction
The rather unstructured data in the form of sections, field-value pairs, and named entities identified so far undergo validation and structuring into JSON data elements like this:
Some clients require the structured information to comply with EMR data standards like the HL7 C-CDA which requires the information to follow well-defined extended markup language (XML) schemas. For such clients, we transform the structured information from JSON to the required XML schema.
Similarly, clinical research also follows its own data standards to record clinical trials and clinical information of patients. Our pipeline can handle that use case too.
An approach we're exploring is turning the layout understanding model into a seq2seq model to directly output C-CDA XML, just like the document understanding transformer (Donut) model.
Looking for help building emr data extraction tools?
In this article, we explored the challenges we face in EHR data extraction and explained the pipeline of deep learning models we use to accurately extract the data.
If you're a medical startup, health care organization, or service provider assisting with EMR processing, we can help you implement reliable, scalable, and accurate EMR data extraction.
Contact us for consultation and insights into any challenges you're facing in your EMR data extraction initiatives.