ReAct Prompting: How We Prompt for High-Quality Results from LLMs | Chatbots & Summarization
Find out how ReAct prompting enables you to introduce human-like reasoning and action planning into your LLM-assisted business workflows for better results.
Whatever your industry, your business likely faces problems with document management, like errors when transferring numbers from a financial report or receiving a variety of document layouts. Automating them may seem impossible at first.
But state-of-the-art artificial intelligence, natural language processing, and computer vision have evolved to a point where automation for such hard problems is now reliable and practical.
This article introduces you to intelligent document processing (IDP), its applications in different industries, and the various frameworks we leverage every day to automate document processing.
IDP is the ai based automation of extracting useful information from any digital, handwritten, or printed document. Using a mix of the machine learning domains computer vision and NLP, the pipeline learns various key details of unstructured documents to process the information.
Why is it needed? Business documents face several problems that cost time and money:
IDP is a robust solution for the automated processing of such documents with high reliability and accuracy.
Let’s take a look at 4 unique use cases that illustrate the practical benefits of IDP software in different industries.
Legal service agreements are processed and reviewed by attorneys for key information (terms of service, payment amount, party information), and for risk management on sell side or buy side. A high-level review by an attorney may take an hour while a detailed review may take several hours.
Businesses can cut down the time and costs of legal reviews using intelligent service agreement processing. The pipeline can summarize a 30-page agreement down to just a few sentences within 30 seconds while retaining critical sentences of legal significance. It does this using the text summarization capabilities of large language models.
Every business — large, medium, or small — needs invoice and receipt processing to automate its AP & AR processes. But both manual and template-based approaches face several challenges — they’re time-consuming, cost too much, have problems with the variety of layouts, or include too much background noise that makes processing more complex.
In contrast, intelligent automated invoice processing that leverages a custom deep learning pipeline takes just three seconds per invoice, reduces the per-invoice processing cost by up to 85% compared to manual processing, and achieves a huge level of data variance coverage.
The same deep learning framework we use for the above document processing work can be used to extract key entities from resumes. We can go from any resume format to a structured JSON of fields such as work history, name, skills, certifications and more, with 92% accuracy.
We train this document processing framework on resume formats from the most common sources such as Linkedin and Google Docs, as well as custom multi-column examples with higher variance and more fields. Modern intelligent document processing pipelines don’t use old school rules based or template based systems which allows us to easily scale to new resume sources over time.
The cover sheets of legal documents contain important information like the names of plaintiffs, defendants, attorneys, judges, and case dates. Reviewing and transferring them manually to a case management system can take hours of manual effort.
But by applying intelligent legal document processing, law firms and courts can shave off hours of labor per document. You can extract data from both state and federal court cover sheets as we’ve scaled the data variance up to over 60 different formats that are commonly used. Even difficult cover sheets like the state of California that have multiple boxes and Spanish language text can be processed and formatted into structured data.
Intelligent document pipelines are composed of five phases:
In the next sections, we explore each phase in depth.
During the data capture & input processing phase, a batch of documents is received and passes through a pre-processing module. A good IDP solution should be capable of receiving and handling a large volume of documents.
These documents can come in many formats from many sources:
This module contains a mix of noise reduction and document enhancement models used to improve the results of the downstream document processing models.
Document understanding (DU) is the most important phase of IDP during which useful information is extracted from a document. It covers tasks like:
Unlike template-based processing approaches, IDP aims for a fully automated, in-depth understanding of documents. It achieves that by using deep learning extensively for all the tasks above. That’s why a typical IDP pipeline may have many DL models, each trained for a specific task.
Before delving into the steps of document understanding, we’ll first introduce some of these models and their terminology.
Let’s first review some DL models and terms that you’ll encounter in the rest of this article.
Though the document text is the primary information for DU, the visual cues of text fragments — their positions, shapes, or borders — are essential for correctness. Many DU models examine both modalities — textual and visual — simultaneously, making them multimodal visual language models.
Transformers are a family of neural networks. Since they take in an input sequence and produce another sequence, they’re called seq2seq models.
A transformer network consists of an encoder or a decoder block or both. An encoder accepts a sequence and produces an embedding vector. The embedding is then read by a decoder to generate an output sequence.
Due to their scalability and ability to embed not just short-range but also long-range context, transformers are preferred to older seq2seq models like recurrent networks. Most state-of-the-art visual language models are implemented as transformers.
Bidirectional Encoder Representation from Transformers (BERT) is a pre-trained transformer model that’s popular for natural language processing (NLP). It consists of only an encoder block.
GPT-3 is a transformer-based large language model that’s trained on a variety of online text datasets. GPT-3 is a pure NLP model and does not accept any visual features. Unlike BERT models that you can run on your infrastructure, GPT-3 is a managed application programming interface (API) by OpenAI. But though it’s an API, it lets you fine-tune it using your custom data.
Convolutional neural networks (CNNs) are used for computer vision tasks like object detection. They accept images and only work with visual features. Though CNNs like ResNet and EfficientNetV2 remain popular, vision transformers are slowly replacing them.
With this overview done, we can now explore the steps of a DU pipeline.
Some document formats store text for easy extraction. But others, like images, require recognizing text from their layout. For that, we use two broad approaches.
This is the traditional approach. First, rectangular regions with text are identified by a text detection CNN. Then an OCR model recognizes each character in every region.
Since OCR that uses only visual features is prone to misidentification, the preferred approach is to combine them with features from language models to avoid identifying a character that’s unlikely in its surrounding text. LayoutLMv2 is a state-of-the-art example of this approach.
A newer OCR-free approach uses a transformer to directly map visual features to a text sequence without producing intermediate data like text regions and character classes.
The document understanding transformer (Donut) uses this approach. It consists of an encoder and a decoder block:
For example, if the downstream task is document classification, the decoder produces “<classification><class>court-record</class></classification>” as its output sequence.
The benefit of this approach is that it’s easier and faster to train because it has far fewer parameters than OCR-based models. Its reported accuracy is also higher.
Some documents, like invoices, come in a wide variety of layouts. People have no trouble handling this variety because we combine visual cues, positions, surrounding context, and our linguistic knowledge to understand them.
For accurate DU, a deep learning model has to replicate such human understanding. If it classifies a document as an invoice, it should identify text in a table as a probable line item. If it’s a form, handwritten text inside a box may be a field’s value and the adjacent box is probably the field’s name.
Like OCR approaches, layout analysis can be explicit or implicit. Some pipelines do it explicitly using separate models — one for table detection, one for field detection, another for document classification, and so on.
Other models like the ones we’ve already seen — LayoutLMv2 and Donut — are end-to-end pipelines that implicitly understand the relative positioning of text fragments.
The latter’s encoders implicitly differentiate between text inside a table, text in a box, and free text because they all have different visual features around them. Similarly, free text on a form has different visual features from that on a court record. Since the embeddings generated for each of them are different, their decoders have no trouble generating different output sequences too.
Information extraction identifies useful information in documents and labels them with the correct field names. For example, if a form has a mailing address field, it identifies the text in the adjacent box as the mailing address.
Models like Donut’s decoder are fine-tuned for this task using datasets like the consolidated receipt dataset. Once fine-tuned, it can identify information in nested groups to produce sequences like "<items><item>{name, count, price}</item></items>" from a receipt. Similarly, custom fields can also be identified through fine-tuning.
Alternatively, we can use GPT-3 to extract data.
NER identifies names of people, places, companies, medicines, and similar based on the surrounding text.
An NER model is a seq2seq model trained with a labeled dataset and a pre-trained language model like BERT. When it spots a named entity in an input sequence, it includes its entity name in the output sequence. But a drawback is that if you want a new entity, you have to update your training dataset and retrain the entire model.
In contrast, GPT-3’s task agnostic knowledge allows for dynamic NER. If you send the right prompts for the entities you want, GPT-3 will select them in your document. This allows you to scale of your NER pipeline without fully labeling a document set and without setting specific labels.
An early stage pipeline struggles to handle new types of documents correctly. You almost always want to fine-tune it with a custom dataset that is specific to your use case.
End-to-end models like Donut’s decoder are easier and cheaper to fine-tune because you have just one model in the pipeline. Alternatively, you can fine-tune GPT-3 on your custom data.
Another good approach is using models capable of few-shot learning. For example, instead of using BERT, use SBERT.
In addition to the above tasks, other common tasks in DU are:
Evaluating the correctness and quality of the extracted information is essential for business operations. Without that, serious problems like undetected fraud in invoices, bad loans, or unfavorable contract conditions can blow up your business.
Validation is done using a variety of techniques:
The extracted information is often exported to formats like Javascript object notation, PDF, Excel, or reports.
Plus, depending on your business needs, IDP stores the extracted data in a variety of destinations:
An IDP pipeline is always a part of some larger business process that may involve:
A good IDP solution integrates with your existing business workflows and business systems seamlessly to streamline them.
In this article, you explored the use cases and internals of IDP. Are you interested in streamlining your business processes using IDP solutions? Or extracting insights from your paper and electronic documents? Or perhaps you’re stuck with legacy documents and looking for reliable digital transformation.
With our expertise in IDP, we have the answers you need. Contact us!