How Intelligent Document Processing Uses Machine Learning To Remove Manual Processes From Your Business

Karthik Shiraly
November 10, 2021

Every day, millions of reports are produced and forms are filled around the world. Businesses and governments have to process them as quickly. Some need financial data from business reports while others need to transfer data out of forms into digital databases.

But the reports may be PDFs or custom invoices with no convenient way to extract the data quickly. Is there no way to avoid hiring manual data entry services? This is the world's unstructured data problem.

Unstructured data is data in a form that is not suited to computers. Some estimates say as much as 80% of all data is like this. That's way too many industry, business, legal, and government reports to ignore.

It's not just conventional documents either. Legal case documents, employee contracts, product labels, SKU forms — these are all unstructured data that different organizations need to process digitally.

This is where intelligent document processing (IDP) shines. 

Intelligent document processing automates data extraction from any kind of handwritten, printed, and digital documents. Using artificial intelligence, machine learning, and deep learning, it understands text and writing just like a human does. 

For example, when it sees an unknown name, it can identify if it’s the name of a person or the name of an organization. When it sees a set of numbers, it can identify if it’s a monetary amount or a phone number or an address. When it sees tabulated text, it can identify rows, columns, and cell values. The identified text is converted into structured data that computers can process easily.

How Intelligent Document Processing Can Help You

IDP is transformative no matter the business vertical or horizontal you're in. These two case studies will help you understand how.

IDP Automation Solutions for Financial Services

Stripe is a popular payments service that enables websites to accept payments from their users. More than 2 million websites across 44 countries use it.

As an operator in the highly regulated financial industry, Stripe has to follow strict Know Your Customer (KYC) regulations. 

They ask users — individuals and businesses — to upload a variety of documents that prove their identities and addresses. The documents should meet a number of rules and quality conditions.

Imagine the complexities Stripe faces:

  • Every country issues different sets of documents for identity, address, and articles of incorporation.
  • Their layouts and the details inside are different.
  • They use different languages.
  • Photo quality will vary.

Such KYC workflows are certainly not unique to Stripe. Indeed, their KYC data volume may be relatively less. Banks and governments handle large volumes of KYC data that are an order of magnitude more. Your company probably has, or wants, comparable volumes.

IDP can efficiently streamline and automate such workflows. It automatically checks the quality of uploads, extracts KYC details from them, and stores the data for future searches. In this way, it boosts your company's operational efficiency and onboarding scalability.

IDP Automation Solutions for Healthcare

intelligent document processing: Medical History form

Healthcare is another heavily regulated industry with a high documentation burden.

Surveys of healthcare workers reveal that most of them spend more time on paperwork than on patient care. Other independent research confirms these findings.

But why is this, despite the industry-wide shift to electronic health records?

Well, it turns out that time-consuming manual data entry is still the norm. While the medium of entry may now be digital, workers still type out text and fill fields. Additionally, information is sometimes recorded on printed progress notes and cover sheets.

In all fairness, it's not like the industry has ignored the problem. Knowledge process outsourcing and robotic process automation are used extensively. But KPO and RPA have limited capability and efficiency. KPO offloads tasks to other people but is not scalable. RPA scales simple automated tasks but is not intelligent.

Enter IDP. IDP is the magic solution to this impasse that brings intelligence, scalability, and efficiency at once. Further, by freeing up healthcare workers to focus on patient care, patients experience improved quality of healthcare.

Other Industries

Those are just two illustrative examples. Other organizations where IDP is transformative include banking, insurance, law, education, engineering, and government. 

IDP is used across these verticals for use cases like:

  • Document transcription
  • Invoice processing
  • Insurance claims processing
  • Regulatory compliance
  • Legal document processing
  • Customer relationship management (CRM) integration and customer experience automation
  • Enterprise resource planning (ERP) integration
  • Supply chain automation
  • Intelligent automation of existing KPO and RPA workflows

The IDP Process

IDP is actually an approach for digital transformation through automation. You can implement the stages that benefit your specific business problem and ignore or defer the stages that don't.

In that spirit, we can break down IDP into the following stages:

  1. Data acquisition and data capture: This stage covers the automation of everything involved in acquiring and storing raw unstructured data such as scans or photos. It includes software and hardware automation.
  2. Data pre-processing: The raw images are transformed to make them easier to process by downstream tasks.
  3. Document understanding: This is the heart of IDP. It covers text detection, recognition, layout analysis, and information extraction. Machine learning is heavily used. We'll come back to this stage to understand it in depth.
  4. Information validation: Is the extracted information valid according to business rules? Software rules automatically verify the data and alert operators if errors are found. 
  5. Human-in-the-loop improvements: Subsets of the extracted data are selected randomly for manual reviews. Errors detected here are turned into training examples to incrementally improve the accuracy of the document understanding stage.
  6. Information storage: The validated, structured information is stored in a data warehouse for retrieval and integrations.
  7. Process integrations: Every business will have some process where the extracted data gets used by other systems. It could be simple report generation or more complex ERP integration. IDP covers automated transformation and transfer of data to those systems.

How Does Document Understanding Work?

Document processing became fairly intelligent only after 2011 thanks to advanced algorithms in artificial intelligence, deep learning, computer vision, and natural language processing (NLP).

These advances enabled document understanding — the ability of machines to extract information the way people do.

Three tasks comprise document understanding:

  • Text understanding
  • Layout analysis
  • Information extraction

Here, we'll look under the hood of each of these tasks. If you want to dig even deeper, you may find this survey of document understanding techniques interesting.

Text Understanding

Scalr Text Understanding

Text understanding detects and recognizes all the printed and handwritten characters in a document.

For document scans and photos, text detection is used to first identify the regions where text is present. Convolutional neural networks (CNNs) are heavily used for coarse-grained and fine-grained text detection.

Object detection is a coarse-grained method. How does it work? Blur your eyes while reading this article. Notice how all text regions look different? Object detection works the same way. It tells you the positions of rectangular regions where text is detected. 

It's fast and works well for typical layouts. But avoid it for complex documents or stylized text.

Text segmentation is more fine-grained. It examines each pixel, classifies whether it belongs to a unit of text, and includes it in a map of labeled text pixels called the mask. The unit of text depends on the training. Some use lines of text, some use words, and some use characters.

It handles complex layouts and stylized text (such as product labels) better. But be aware that creating training sets to fine-tune accuracy can be time-consuming. U-net is a popular neural network model that can be used for text segmentation.

Character instance segmentation is an excellent model if you have a large variety of documents, fonts, and languages to process, like the Stripe KYC example.

Text detection is followed by text recognition to actually identify the characters laid out throughout the document. It typically uses optical character recognition. Sometimes, you may need intelligent character recognition instead of OCR. Intelligent character recognition is advanced OCR that can handle handwritten text, emojis, glyphs from different fonts, or different scripts.

Layout Analysis

Layout analysis enables your IDP solution to see documents the way people do.

It's needed for document classification. Your business may process a variety of document types. The IDP solution needs to know the extraction model to apply to a particular document. It classifies the document by type based on its layout and applies the relevant extraction model.

Page segmentation methods detect high-level layout elements such as text, figures, and tables. You'll find them sufficient for most use cases.

Logical structure methods identify more fine-grained elements, such as paragraphs or headings. You may need these for workflows that rely on text formatting, such as treating headings as topic tags while storing in a database.

Layout analysis outputs a set of layout elements and their types, positions, dimensions, and structural characteristics. They are used during information extraction.

Information Extraction

intelligent document processing: Scalr Information Extraction

This is the crucial stage where everything comes together to output structured data. Given an invoice, it extracts details like the customer's name, address, quantities, and amounts. Given a hand-filled paper form, it extracts all field names and text written in boxes.

How does it work?

A variety of deep learning architectures are available. Some use convolutional neural networks. Some use combinations of convolutional and recurrent or transformer networks. Some use graph convolutional networks. 

But generally, they all work on the same intuition. When you see an invoice, you instantly recognize it as such because most invoices have a characteristic visual layout. The same is the case with paper forms. These are called visual features. 

You can also recognize a sequence of numbers in a form as a telephone number even if you can't read the language in its box. These are called textual features.

Every network architecture is trained to correlate these visual and textual features with structured information. When the network sees a dark box with printed text in a paper form, it knows that it must be a field name and the text in the adjacent box must be its value.

In the next section, we'll flesh out this intuition using a state-of-the-art neural network architecture for information extraction.

TRIE Neural Network

TRIE Neural Network

TRIE is a recent architecture introduced in 2020 by the research paper End-to-End Text Reading and Information Extraction for Document Understanding. We’ll use it to understand how these architectures typically work.

TRIE happens to be an end-to-end model, meaning that it does all three tasks together — text understanding, layout analysis, and information extraction. Based on factors like availability of training data and performance requirements for your specific business problem, we recommend whether to go for one end-to-end model or three independent models.

TRIE has three blocks. The text reading block is for text understanding. The multimodal context block does a type of layout analysis. The information extraction block produces structured data.

The text reading block consists of two networks.

First is an object detection network to detect text regions and positions. It's based on the feature pyramid network architecture. It outputs all the rectangular regions where text is detected. 

Additionally, this network outputs a feature vector for each text region. It's called image embedding and the process is called encoding. It describes that region's characteristic visual features mathematically.

The second network in the same block is a character recognition network to identify the detected text. It's a recurrent neural network with long short-term memory (LSTM) units. Its inputs are the image embeddings of text regions. For each region, it outputs a sequence of characters.

In summary, the text reading block outputs positions of text regions, their image embeddings, and their character sequences.

These three data are input to the multimodal context block. It fuses them to produce a richer set of visual and textual features that improve the information extraction step.

The information extraction block consists of a bidirectional LSTM recurrent neural network. Its inputs are the character sequences along with the rich visual and textual features from the multimodal context block. 

Its outputs are field-value pairs of data. In this way, a document image is converted to structured data.

Get the Smart Document Processing Solution You Need

Interested in streamlining your business processes using IDP? Or in extracting insights from your paper and electronic documents? Or maybe you’re stuck with legacy documents and unsure about which document processing solution to go for. 

We have the answers. Let’s talk!