What is Intelligent Document Processing (IDP) & How You Can Get Started Automating Document Data Extraction

Karthik Shiraly
November 20, 2022

Whatever your industry, your business likely faces problems with document management, like errors when transferring numbers from a financial report or receiving a variety of document layouts. Automating them may seem impossible at first.

But state-of-the-art artificial intelligence, natural language processing, and computer vision have evolved to a point where automation for such hard problems is now reliable and practical.

This article introduces you to intelligent document processing (IDP), its applications in different industries, and the various frameworks we leverage every day to automate document processing. 

What Is Intelligent Document Processing?

IDP is the ai based automation of extracting useful information from any digital, handwritten, or printed document. Using a mix of the machine learning domains computer vision and NLP, the pipeline learns various key details of unstructured documents to process the information. 

Why is it needed? Business documents face several problems that cost time and money:

  • Wide variety of layouts: Documents have tables, images, different sections and more data variance that makes automation more challenging. 
  • Poor image resolution or scanning quality: Documents have a wide range of noise that can come from scanning documents, taking pictures, or just low quality PDFs. 
  • Archaic documents stored on paper (e.g., land titles, legal documents)
  • Important handwritten details (e.g., court records, account numbers of invoices)
  • Some document types (legal,financial) require an expert understanding of the domain.
  • Manual document processing is slow and often has repetitive tasks.

IDP is a robust solution for the automated processing of such documents with high reliability and accuracy.

Applications of Intelligent Document Processing

Let’s take a look at 4 unique use cases that illustrate the practical benefits of IDP software in different industries.

1. Legal Agreement Processing

legal document processing for master service agreements

Legal service agreements are processed and reviewed by attorneys for key information (terms of service, payment amount, party information), and for risk management on sell side or buy side. A high-level review by an attorney may take an hour while a detailed review may take several hours.

Businesses can cut down the time and costs of legal reviews using intelligent service agreement processing. The pipeline can summarize a 30-page agreement down to just a few sentences within 30 seconds while retaining critical sentences of legal significance. It does this using the text summarization capabilities of large language models.

2. Invoice and Receipt Processing

invoice and receipt processing pipeline

Every business — large, medium, or small — needs invoice and receipt processing to automate its AP & AR processes. But both manual and template-based approaches face several challenges — they’re time-consuming, cost too much, have problems with the variety of layouts, or include too much background noise that makes processing more complex.

In contrast, intelligent automated invoice processing that leverages a custom deep learning pipeline takes just three seconds per invoice, reduces the per-invoice processing cost by up to 85% compared to manual processing, and achieves a huge level of data variance coverage.

bounding box recognition for receipts

3. Resume Processing & Information Extraction

bounding box recognition for resume processing

The same deep learning framework we use for the above document processing work can be used to extract key entities from resumes. We can go from any resume format to a structured JSON of fields such as work history, name, skills, certifications and more, with 92% accuracy. 

We train this document processing framework on resume formats from the most common sources such as Linkedin and Google Docs, as well as custom multi-column examples with higher variance and more fields. Modern intelligent document processing pipelines don’t use old school rules based or template based systems which allows us to easily scale to new resume sources over time. 

resume processing JSON extracted data
Go from input resume and recognized text to extracted fields with nested sections. 

resume data extracted into json format
Unstructured data to simple to use JSON fields that allow you to skip the manual data entry

4. Legal Documents Processing

legal document cover sheet information extraction

The cover sheets of legal documents contain important information like the names of plaintiffs, defendants, attorneys, judges, and case dates. Reviewing and transferring them manually to a case management system can take hours of manual effort.

But by applying intelligent legal document processing, law firms and courts can shave off hours of labor per document. You can extract data from both state and federal court cover sheets as we’ve scaled the data variance up to over 60 different formats that are commonly used. Even difficult cover sheets like the state of California that have multiple boxes and Spanish language text can be processed and formatted into structured data. 

How Intelligent Document Processing Solutions Work

intelligent document processing solutions workflow

Intelligent document pipelines are composed of five phases:

  • Data capture
  • Document understanding
  • Information validation and evaluation
  • Information storage
  • Process integrations

In the next sections, we explore each phase in depth.

Data Capture & Input Processing

During the data capture & input processing phase, a batch of documents is received and passes through a pre-processing module. A good IDP solution should be capable of receiving and handling a large volume of documents.

These documents can come in many formats from many sources:

  • From document or content management systems in digital formats like Microsoft Office or portable document format (PDF)
  • As email attachments
  • As printouts, faxes, or paper documents by mail that should be scanned into image formats

This module contains a mix of noise reduction and document enhancement models used to improve the results of the downstream document processing models. 

Document Understanding

Document understanding (DU) is the most important phase of IDP during which useful information is extracted from a document. It covers tasks like:

  • Text extraction
  • Layout analysis (Positional understanding for task focused relevant data)
  • Document classification (e.g., identify a document as an insurance claim or a court record)
  • Data extraction as fields and values
  • Named entity recognition
  • Summarization of long-form text

Unlike template-based processing approaches, IDP aims for a fully automated, in-depth understanding of documents. It achieves that by using deep learning extensively for all the tasks above. That’s why a typical IDP pipeline may have many DL models, each trained for a specific task.

Before delving into the steps of document understanding, we’ll first introduce some of these models and their terminology.

Overview of Deep Learning Models and Terminology

Let’s first review some DL models and terms that you’ll encounter in the rest of this article.

Visual Language Models

Though the document text is the primary information for DU, the visual cues of text fragments — their positions, shapes, or borders — are essential for correctness. Many DU models examine both modalities — textual and visual — simultaneously, making them multimodal visual language models.


transformer architecture for intelligent document processing
Transformer architecture (Source: Vasvani et al.)

Transformers are a family of neural networks. Since they take in an input sequence and produce another sequence, they’re called seq2seq models.

A transformer network consists of an encoder or a decoder block or both. An encoder accepts a sequence and produces an embedding vector. The embedding is then read by a decoder to generate an output sequence.

Due to their scalability and ability to embed not just short-range but also long-range context, transformers are preferred to older seq2seq models like recurrent networks. Most state-of-the-art visual language models are implemented as transformers.


Bidirectional Encoder Representation from Transformers (BERT) is a pre-trained transformer model that’s popular for natural language processing (NLP). It consists of only an encoder block.


document processing pipeline with gpt-3

GPT-3 is a transformer-based large language model that’s trained on a variety of online text datasets. GPT-3 is a pure NLP model and does not accept any visual features. Unlike BERT models that you can run on your infrastructure, GPT-3 is a managed application programming interface (API) by OpenAI. But though it’s an API, it lets you fine-tune it using your custom data.

Convolutional Neural Networks

Convolutional neural networks (CNNs) are used for computer vision tasks like object detection. They accept images and only work with visual features. Though CNNs like ResNet and EfficientNetV2 remain popular, vision transformers are slowly replacing them.

With this overview done, we can now explore the steps of a DU pipeline.

1. Text Extraction Approaches for Processing Documents

Some document formats store text for easy extraction. But others, like images, require recognizing text from their layout. For that, we use two broad approaches.

Optical Character Recognition (OCR)

OCR pipeline for document processing
Traditional OCR approach (Source: G. Kim et al.)

This is the traditional approach. First, rectangular regions with text are identified by a text detection CNN. Then an OCR model recognizes each character in every region.

Since OCR that uses only visual features is prone to misidentification, the preferred approach is to combine them with features from language models to avoid identifying a character that’s unlikely in its surrounding text. LayoutLMv2 is a state-of-the-art example of this approach.

OCR-Free Approaches

OCR-free document understanding transformer
OCR-free document understanding transformer (Source: G. Kim et al.)

A newer OCR-free approach uses a transformer to directly map visual features to a text sequence without producing intermediate data like text regions and character classes. 

The document understanding transformer (Donut) uses this approach. It consists of an encoder and a decoder block:

  • The encoder is a visual-language model trained on datasets like IIT-CDIP. Given an input image, it recognizes the text internally and generates embeddings.
  • The decoder is another transformer fine-tuned for the desired task. It interprets the encoder’s embeddings to produce relevant output sequences as structured text.

For example, if the downstream task is document classification, the decoder produces “<classification><class>court-record</class></classification>” as its output sequence.

The benefit of this approach is that it’s easier and faster to train because it has far fewer parameters than OCR-based models. Its reported accuracy is also higher.

2. Document Classification and Layout Analysis

Some documents, like invoices, come in a wide variety of layouts. People have no trouble handling this variety because we combine visual cues, positions, surrounding context, and our linguistic knowledge to understand them.

For accurate DU, a deep learning model has to replicate such human understanding. If it classifies a document as an invoice, it should identify text in a table as a probable line item. If it’s a form, handwritten text inside a box may be a field’s value and the adjacent box is probably the field’s name.

Like OCR approaches, layout analysis can be explicit or implicit. Some pipelines do it explicitly using separate models — one for table detection, one for field detection, another for document classification, and so on.

Other models like the ones we’ve already seen — LayoutLMv2 and Donut — are end-to-end pipelines that implicitly understand the relative positioning of text fragments. 

The latter’s encoders implicitly differentiate between text inside a table, text in a box, and free text because they all have different visual features around them. Similarly, free text on a form has different visual features from that on a court record. Since the embeddings generated for each of them are different, their decoders have no trouble generating different output sequences too.

3. Information Extraction Sets up Automated Document Processing

Information extraction identifies useful information in documents and labels them with the correct field names. For example, if a form has a mailing address field, it identifies the text in the adjacent box as the mailing address.

Models like Donut’s decoder are fine-tuned for this task using datasets like the consolidated receipt dataset. Once fine-tuned, it can identify information in nested groups to produce sequences like "<items><item>{name, count, price}</item></items>" from a receipt. Similarly, custom fields can also be identified through fine-tuning.

Alternatively, we can use GPT-3 to extract data.

4. Named Entity Recognition (NER)

NER identifies names of people, places, companies, medicines, and similar based on the surrounding text.

An NER model is a seq2seq model trained with a labeled dataset and a pre-trained language model like BERT. When it spots a named entity in an input sequence, it includes its entity name in the output sequence. But a drawback is that if you want a new entity, you have to update your training dataset and retrain the entire model.

In contrast, GPT-3’s task agnostic knowledge allows for dynamic NER. If you send the right prompts for the entities you want, GPT-3 will select them in your document. This allows you to scale of your NER pipeline without fully labeling a document set and without setting specific labels.

5. Fine-Tuning on Custom Document Data

An early stage pipeline struggles to handle new types of documents correctly. You almost always want to fine-tune it with a custom dataset that is specific to your use case.

End-to-end models like Donut’s decoder are easier and cheaper to fine-tune because you have just one model in the pipeline. Alternatively, you can fine-tune GPT-3 on your custom data.

Another good approach is using models capable of few-shot learning. For example, instead of using BERT, use SBERT.

6. Other Common Tasks

In addition to the above tasks, other common tasks in DU are:

  • Keyword and key-phrase detection
  • Summarization of long-form text
  • Sentiment analysis

Information Validation

Evaluating the correctness and quality of the extracted information is essential for business operations. Without that, serious problems like undetected fraud in invoices, bad loans, or unfavorable contract conditions can blow up your business.

Validation is done using a variety of techniques:

  • Every model should generate task-specific metrics like confidence scores and F1-scores. They should be logged against the document number that was processed. If a model scores too low or fails, that document should be stored in a failed queue for human analysis.
  • If there are many metrics, make them intelligible by combining them into a single master score that’s indicative of overall quality.
  • Users should evaluate tasks like summarization that involve subjective perceptions of results. Use tools like RankME to collect feedback from users and adjust the pipeline.
  • You may need human-in-the-loop workflows for custom data labeling and evaluating the fine-tuning of your models. 
  • Verify extracted data against external sources (e.g., verify that addresses and phone numbers on a loan application are not fake).

Information Storage

The extracted information is often exported to formats like Javascript object notation, PDF, Excel, or reports.

Plus, depending on your business needs, IDP stores the extracted data in a variety of destinations:

  • Amazon S3
  • Databases
  • Data warehouses or data lakes
  • Third-party systems like enterprise resource planning (ERP), case management, or customer relationship management (CRM) systems

Process Integrations

An IDP pipeline is always a part of some larger business process that may involve:

  • Approvals by authorized employees (e.g., approving high-value invoices)
  • Email and other communication channels
  • Data analysis by business intelligence and data science teams
  • Report generation
  • Signatures
  • General business systems like ERP, accounting, human resource management, or CRM
  • Industry-specific systems like case management, core banking, or health care information systems
  • Existing robotic process automation (RPA) systems

A good IDP solution integrates with your existing business workflows and business systems seamlessly to streamline them.

Get the Intelligent Document Processing Software Solution You Need

In this article, you explored the use cases and internals of IDP. Are you interested in streamlining your business processes using IDP solutions? Or extracting insights from your paper and electronic documents? Or perhaps you’re stuck with legacy documents and looking for reliable digital transformation.

With our expertise in IDP, we have the answers you need. Contact us!


  • Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park (2021). “OCR-free Document Understanding Transformer”. arXiv:2111.15664 [cs.LG]. https://arxiv.org/abs/2111.15664 
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). “Attention Is All You Need”. arXiv:1706.03762 [cs.CL]. https://arxiv.org/abs/1706.03762