Solving Modern Document Classification Challenges With Deep Learning | SOTA Document Classification Use Cases

December 7, 2022

Document classification is a common task in business as every document has to undergo some business workflow, and sending it the wrong way can be expensive, especially in regulated industries.

In this article, we explore accurate classification algorithms using the latest innovations in deep learning, computer vision, natural language processing (NLP), and machine learning models.

‍

What Is Document Classification?

Document classification is a machine learning task to identify the class or type of document. For example, given a large set of scanned documents, your business may need to sort them into invoices, receipts, contracts, pay slips, and expenditure reports. The types of documents are often domain- and task-specific.

A class is determined based on a document’s text content, visual features, or both. Other aspects like metadata, location, or file name may also determine its class.

Two common concepts related to document categorization are:

Multi-class classification: In this scheme, many classes exist but a document can belong to only one class. The classes are all mutually exclusive. Example: In invoice processing, an invoice can belong to only one vendor.
Multi-label classification: Many classes exist to describe different aspects that are not mutually exclusive. Example: An invoice will have a vendor and a due date. It belongs to only one vendor and has just one due date, but it can have both labels simultaneously. Its vendor and due date are not mutually exclusive aspects. This is an extremely time consuming process to perform with manual classification.

In the following sections we’ll dive into various automatic document classification techniques.

‍

Introduction to Transformers

transformer architecture — *Transformer architecture (Source:* *Vasvani et al.*)

Since we’ll refer to transformers often in this article, here’s a short overview of them.

Transformers are a family of deep neural networks designed for sequence-to-sequence tasks (e.g., language translation or question-answering). Using techniques like self-attention and multi-head attention blocks, transformers understand the long-range context in a text (or other data) and scale well during training.

A transformer consists of an encoder network or a decoder network or both. You can train and use them separately or as a single end-to-end network. An encoder generates rich representations, called embeddings, from input sequences. A decoder combines an encoder’s embeddings with output from previous steps to generate the next output sequences.

‍

Long Document Classification System With a GPT-3 Based Pipeline

GPT-3 provides us with a pre-trained model that lets us leverage a baseline task (unsupervised document classification, sentiment analysis, summarize, etc.) agnostic understanding of language with a guided understanding of classifying text via few shot learning and fine-tuning. GPT-3 provides us with a few key benefits for long document classification over other architectures.

Prompt size: The most recent version of the Davinci instruct model “text-davinci-002” allows up to 4,000 tokens in the prompt. This means you need less chunks and can fit more of the document in each GPT-3 API call we make (or the entire document in one prompt!).
Pre-trained model: The pre-trained GPT-3 model gives you a baseline understanding of language and next-token prediction. In most NLP use cases this means you can start classifying text with way less data then what is required for models that do not have any pre-training or leverage transfer-learning.
Prompt-based programming: The prompt allows you to use few-shot learning techniques to further leverage a low data environment to start classifying text. Few-shot learning means providing a few examples to GPT-3 of how to complete the task before adding our input. The example below shows providing a single example box score from a sporting event and classifying the text before passing our input and asking GPT-3 to complete the text.

We’ll use this model as a part of a pipeline for classifying long documents (20+ pages) leveraging our proven long document GPT-3 pipeline. We can also use this few-shot environment to dramatically speed up the manual document classification process required to create training data.

Long Document Classification Architecture

We’ve designed an architecture that allows us to scale our long document classification to longer documents without needing to adjust input size for large variance between different input documents.

Text Extraction via OCR

The text extraction module is used to extract document text from PDFs, images, and Word documents. While the most common use case is extracting text from unstructured documents where the text simply flows in a natural left to right direction, we can fine-tune this module to extract text in a more structured format based on the type of document. Documents such as legal documents, invoices, and other documents with tabular formats have special positional structures that should be accounted for and the text should be structured.

legal document cover sheet example — Example legal document cover sheet from our case study.

‍

Input Preprocessing Focused on Information Reduction

The goal of this module is to reduce our document text input size by removing information that is not relevant to classification. There’s a few reasons we do this:

Depending on how much text we remove, we can remove the chunking algorithm step, which does add complexity to our classification. In most other documents this module at least reduces the number of chunks required.
It gives us more room in our prompt for few-shot learning examples. We have an input maximum of 4,000 tokens and the more we can reduce the amount of text in our input, the more few-shot examples can be included.

Few-shot prompts are more accurate in most tasks

This module will be a fine-tuned deep learning algorithm focused on understanding what information in our input text generally has the lowest correlation to the correct class. The more information we remove from the input text, the less future data variance we can cover. The tradeoff is that our downstream classification model can use more few-shot examples and understand our current evaluation dataset better.

Imagine we are reducing 1,600-word chunks to a single sentence theme. It will be harder for future inputs with new data variance to reach the same accuracy considering how much information we had to remove to reach a single sentence in most cases. The tradeoff here is we can include more examples in our classification model and can fit this exact dataset well. I recommend starting much wider in the amount of information we keep in this step when the dataset is small and data variance coverage is low.

Chunking Algorithm for Document Classification

Depending on the size of the document after preprocessing, you still might need to chunk the document. At a high level the chunking algorithm is an algorithm used to break apart the large document text to be classified in parts, then combined back together using an output algorithm that works to create a combined classification. These chunking algorithms can be as simple as just splitting the document up into smaller sizes based on a set number of tokens, or as intensive as using multiple NLP models to make decisions on where to split the document text based on keywords or context.

Build GPT-3 Prompt With Prompt Optimization

Prompt optimization allows us to dynamically build our GPT-3 prompts based on the given input text. The idea is to adjust the prompt language and prompt examples used in our prompt based on the input text. The prompt optimization algorithm chooses these based on a trained understanding of what specific prompt examples from a database lead to us having the highest probability of a successful output from GPT-3. This can be based on a keyword match, semantic match, or similar length.

The benefit of this algorithm is that we can provide GPT-3 with information that is relevant to our input text that better shows GPT-3 how to reach a correct classification for similar text. This method has been shown to improve the results of GPT-3 models up to 30% in some tasks! It makes sense that GPT-3 would be able to better classify a basketball box score if the few-shot examples provided are also box scores instead of a static prompt that has a bunch of different text examples.

Fine-Tune GPT-3 for Automated Document Classification

Fine-tuning the davinci model is a great way to steer the task agnostic GPT-3 model toward our classifying documents task. Fine-tuning does not have a prompt token limit which means we can provide GPT-3 with a ton of training examples showing how to classify our long documents. Each training example is prompt-sized and allows us to stuff much more of our preprocessed document text into each example then we will have in our runtime prompt with everything we’ve seen so far. It’s best practice to make your fine-tuned examples similar to what the model will see at runtime, in terms of length, especially if you’re going to forgo the runtime few-shot examples and just rely on the fine-tuned model.

Fine-tuning won’t provide as much value as prompt optimization until we reach a certain number of training examples per document class. With fine-tuning, no single training example is leveraged heavily by the model when classifying an input document whereas few-shot learning examples in the prompt are the focus of how the model understands completing our task correctly. This is why it's better to use few-shot learning while the dataset has low data variance coverage and doesn’t give deep context for each document class.

Post-Processing Our Output

We have two main tasks that help us clean up and create our results for classifying these long documents.

Confidence score generation: We leverage a custom built algorithm that produces confidence metrics around our chosen classification.
Chunk classification combination: We need to combine the classifications of our individual chunks into a single document classification.

Custom built confidence metrics allow us to put a model confidence value on the outputs of our GPT-3 model. This is an incredibly useful tool to gain insight into how the model classified the document, and allows you to perform real-time tasks such as regenerate poor results, use user feedback, and understand how well your document classification model is performing in production.

Once we’ve generated our classifications for each chunk of the long form document we need to combine them to create a single document classification. There’s a number of different ways we can do this, and similar to prompt optimization, there isn’t one best fit way for every use case. The most common way to create a single document classification is to simply choose the class that was chosen for the most chunks, and return a confidence value based on how many chunks are chosen out of the entire set. A more complex version of this is to evaluate each chunk's confidence score and chunk size of the entire document. With this, the larger chunks are more valuable to the equation.

‍

Document Classification With BERT

In this section, we explore bidirectional encoder representations from transformers (BERT) for document classification tasks.

What Is BERT?

BERT is best thought of as an approach for training transformer encoders for language tasks. It proposes special techniques at training time like:

Masking random tokens in the input
Using sequences from both the left and right of a word as its context
Encoding different information into the input embeddings
Relying on fine-tuning for downstream tasks
Inserting special tokens as placeholders to accommodate the needs of downstream tasks like classification
Using large datasets for unsupervised pre-training with objectives of creating a masked language model and learning next-sentence prediction

These techniques turn an encoder into a versatile pre-trained language representation model that you can quickly fine-tune for specific language tasks.

The BERT research paper also provided two transformer encoders that were trained using this methodology, called BERT-large and BERT-base. Depending on the context, BERT may refer to either the approach or the pre-trained models.

BERT Architecture

BERT does not propose any new network architecture and just reuses the original transformer encoder architecture. Its capabilities stem from its training strategies. The two pre-trained BERT models use the same architecture but differ in their internals:

BERT-large: Has 340 million network parameters across 24 layers (i.e., encoder blocks) and 16 self-attention heads to produce 1024-sized embeddings.
BERT-base: Has 110 million parameters across 12 layers and 12 self-attention heads to produce 768-sized embeddings.

BERT for Document Classification

The DocBERT model fine-tunes both BERT models for document classification. It does this by attaching a simple, fully connected, softmax layer that reports the probabilities of classes for an input embedding.

*BERT classification via the [CLS] special token (Devlin et al.*)

The input to the softmax is the final hidden state corresponding to the [CLS] input token that marks the start of a sequence. This hidden state acts as a latent representation of the input sequence, making it useful for classification tasks.

This classification model (transformer encoder + softmax) is fine-tuned end-to-end on training datasets like:

*DocBERT PyTorch model in action - classification results on a business news feed*

Performance Problems of DocBERT

Although both BERT models achieved the best F1-scores on all four datasets, their enormous sizes made them expensive and slow for both fine-tuning and inference. The next best model achieved comparable F1-scores (usually within 3-4% of both BERTs) and inferred 40x faster with less than 4 million parameters. The inefficiency of the BERT models is unacceptable for many use cases.

Knowledge Distillation and Teacher-Student Learning From BERT

*F1-scores of BERT models, baseline BiLSTM, and KD BiLSTM (Adhikari et al.*)

Can BERT’s awesome capability be transferred to a lighter model to achieve both high accuracy and performance? The DocBERT paper explores algorithms like knowledge distillation (KD) to transfer DocBERT-large’s capability to a lightweight bidirectional long short-term memory (BiLSTM) network.

The BiLSTM is first trained normally on a labeled dataset. DocBERT-large is also fine-tuned on the same dataset. Since the latter’s F1-scores are higher, it’s designated as the teacher and the BiLSTM as the student.

Next, a transfer dataset is created, and the class probabilities inferred by DocBERT on it are set as soft targets for the student. The student aims to fine-tune its trained weights on this transfer dataset so that it matches the teacher’s class probabilities with the least error.

Using this technique, the KD-BiLSTM improved on its own baseline scores and got close to DocBERT-base’s scores while being 25x smaller and 40x faster than it!

How does automatic document classification work?: BERT Document Classification for Visual Document Understanding

LayoutLM embeddings for document classification — *LayoutLM embeddings (Xu et al.*)

The discussion so far gives the impression that BERT is only for language tasks. But that’s not true, and BERT has been used for tasks that combine computer vision and NLP. One such application is visual document understanding that replicates human understanding of complex documents like invoices, contracts, or court records.

Document classification using both visual and linguistic information is often needed. For example, process automation may have to classify documents to send them to different business workflows.

LayoutLM is a visual document understanding model that combines BERT pre-training with visual aspects of text blocks. Both aspects are combined as embeddings to the BERT encoder. The classification layer attached to it learns to identify the document using both visual and textual aspects, just like people do.

‍

Hierarchical Transformers for Long Document Classification

In this section, we explore techniques to overcome the limitations of pre-trained BERT models when processing long documents.

BERT’s Limitations With Long Documents

A drawback of all transformer models is that self-attention is quadratic to the sequence length. That’s why the pre-trained BERT models cap their input sequence length to 512 and truncate everything else because longer documents require quadratically higher computational power.

Another drawback is with the positional encoding scheme that blends position information into the input embeddings. It’s trained only for sequences under 512 items. For longer documents, it has to be retrained. So, in practice, 512 has become a hard limit of the BERT models.

Consequences of This Limitation

Long documents like legal agreements or business plans have multiple sections. Reviewers may need high-level labels like “warning” or “safe” to help them focus on the critical sections.

Many text classification tasks like sentiment analysis may apply different labels to different sections in the same document. But this isn’t possible using BERT.

Hierarchical Network Architecture as a Solution

Hierarchical transformers solve this with a simple algorithm:

Break up long documents into 512-sized chunks.
Get embeddings for each chunk from BERT.
Pass this sequence of chunk embeddings to a sequential layer like a long short-term memory (LSTM) or another transformer.
This second network combines the chunk embeddings into a document embedding.
Classify the document based on the document embedding.

For their experiments, they use the smaller BERT-base model for efficiency. The LSTM used is a small network that produces 100-dimensional document embeddings. It’s called RoBERT, for recurrence over BERT. The second transformer is similarly a small one with just two transformer blocks. It’s called ToBERT, for transformer over BERT.

Datasets

Both RoBERT and ToBERT are fine-tuned on three text classification datasets:

A call center’s phone conversation transcripts dataset
Twenty newsgroups dataset
Fisher phone conversation transcripts

Results

Arranging sequential networks in a hierarchy allows them to overcome their sequence length limits. The BERT-based model scored higher accuracies on some datasets over other support vector machine (SVM) and convolutional models.

Case Study: Multi-Label Classification of Service-Level Agreements

Service level and other legal agreements can run into dozens of pages filled with dense legalese. To help reviewers save time, you can summarize their contents.

For higher confidence that nothing critical is being missed out, you can also run topic identification and sentiment analysis on each section and show them as section labels using hierarchical transformers. They help reviewers focus on the most critical portions of such documents.

‍

Multi-Page Automatic Document Classification

Real-world document processing can be quite messy. Using a mortgage industry case study, we’ll see the type of problems that crop up in paperwork-intensive industries and explore how document classification solves them.

The Loan Documents Arrangement Problem

A loan audit involves reviewing a set of documents called the loan document package. A typical package can have hundreds of scanned pages like land titles, identity documents, income documents, signed declarations, and more. The pages are supposed to be arranged in a particular order to make them easy to process.

But in reality, they are often haphazardly grouped. Identifying and grouping loan documents is a major bottleneck for banks and mortgage companies. So they rely on business process outsourcing to automate some of it and complete the rest manually. But because the documents can be complex, mistakes aren’t uncommon. This raises the costs and time required for processing.

Problems of Semi-Automated Approaches

Some semi-automated techniques are in use out there but are largely unsatisfactory. Using document templates for parsing the information can be faulty and laborious. Custom rule-based pipelines fail when they run into edge cases. Once a pipeline makes mistakes, people stop trusting the entire pipeline and revert to manual verification.

The industry needs automated solutions that can robustly and reliably process most documents with little human involvement.

Document Classification Using Machine Learning

A clever solution to the problem is identifying just the typical starting and ending pages of each document type. They often have very unique layouts that are easily identified. Each document type will have two classes — “[type]-start” and “[type]-end.” If a page isn’t one of these start or end classes, then you just classify it as “other.”

1. Text Recognition and Preprocessing

Each scanned page is processed by an optical character recognition (OCR) engine to extract its text. Any unwanted text is discarded.

2. Doc2vec Vectorizer

The page text is then processed by a doc2vec machine learning algorithm to produce a dense feature vector that represents all the text and its patterns on that page.

3. Logistic Regression for Classification

Using the feature vector, a logistic regression machine learning model infers the class of that page along with confidence scores and other metrics.

Post-Processing

Logical rules are applied to the model’s output sequence to check if all pages of a document type are together. A pipeline like this reduces human effort considerably and the remaining edge cases can be easily managed by the outsourced staff.

‍

High-Class Document Classification With Zero-Shot GPT-3

As we’ve looked at above, the two main constraints in document classification are a high number of classes (especially relative to the number of examples we have per class) and having enough training samples to understand the difference between classes. It’s no surprise that these two constraints go together well to cause issues in production level document classification systems.

We’re going to look at a document classification pipeline that allows us to classify documents into a large number of classes in a constrained zero shot data environment.

Why Do We Want to Do This?

Data labeling of documents can be an extremely expensive and time-consuming task. Industries that require a certain expertise to review documents such as legal or financial are even more expensive given the cost of these resources. Data labeling documents into a single class out of a large amount (high-class) is extremely time-consuming as many classes in this environment are very similar to each other, and small details in the documents are what leads to the variation.

Zero-shot classification is the perfect solution for working around these ideas. True zero-shot classification requires no fine-tuning and no prompt examples to guide the model to a correct output. We will provide the classes available, the input document we want to classify, and the prompt instructions to perform the task. Since we are not providing prompt examples the instructions we use will be very important, as they are the only information used to help the task agnostic GPT-3 model understand how to correctly complete our task.

Zero-shot learning methods are also a great way to speed up the data labelling process for a future fine-tuned or few-shot model. These models will almost certainly be more accurate than the zero-shot solution in the long term. If we understand that our zero-shot model is 80% accurate, we can use this to put a label on all documents we want to use in the future and have a manual reviewer quickly check them before training. This assisted review is much more efficient than full manual review.

Let’s look at an example of an NLP pipeline that leverages a GPT-3 model to perform zero shot classification.

High-Class Document Classification Pipeline

This pipeline focuses on extracting information from documents in a format that provides us the most contextual information relative to the classes we have available. From there the prompt language and instructions are critical to be able to form relationships between document text and classes with little prior understanding of the relationship.

Text Extraction Through Document Understanding

The key step to this pipeline is how we extract our text in a format that provides GPT-3 with an idea of what information is important. It doesn’t make much sense to extract all header, body, abstract, and other common document fields as the same unstructured text, given that different text clearly has varying value when used differently in documents. Downstream, we can tag important information in ways that tells GPT-3 that this text was more valuable in the document. This idea is the same as what you can do with tags such as <h1> when using marketing copy as a variable in a GPT-3 prompt (as seen here).

If we were classifying receipts, we would want to extract the text in a more structured format with an understanding of what text aligns with the various important fields (total cost vs. line item costs, etc.)

There are a ton of pre-trained architectures that you can leverage to extract text from documents in a better format that will allow us to assist zero shot GPT-3 in understanding what information is valuable to the provided document. Libraries like Kleister-NDA let you extract key entities from legal documents and start putting tags around key information without needing to fine-tune the model.

Entity recognition from various documents from Kleister (source)

If you’re willing to fine-tune this text extraction module for better accuracy, leveraging architectures like LayoutLMv2 are perfect for this document understanding task. This architecture contains a spatial-aware self-attention mechanism into the Transformer architecture that allows the model to understand relative positional relationships among blocks of text.

Input Preprocessing and Information Reduction

The goal of this module is to prepare our extracted document text in a format that helps GPT-3 classify by reducing the amount of irrelevant information from our document text and applying tags based on the relationships learned in the previous step. We’ll use input preprocessing tools as simple as removing stopwords and fixing grammar, to complex summarization algorithms that focus on creating large extractive summaries that keep big amounts of the document.

Stop words can easily be removed using spaCy

Build Our Zero-Shot GPT-3 Prompt

A good zero shot GPT-3 prompt has a few key features that allow us to turn this task agnostic model into a document classification model.

Prompt Instructions

These are used to provide GPT-3 with clear instructions on how to complete the task. This allows us to steer the task agnostic GPT-3 model towards our classification task and provide key information that helps differentiate classes such as what variables to focus on, what info in the document should be deemed valuable, and any other rules we believe are important.

Here’s a simple prompt instruction that tells GPT-3 what task to accomplish and what the goal of the task is to us.

Prompt Language

Prompt language is used to provide GPT-3 context around what text is being used in the prompt. This can be variables, rules, or even tags that structure the information a bit more than you would otherwise have. We write Python code in this step that creates this prompt language and automatically adds it to our prompt when building the layout.

During the development process it’s best practice to split test a number of prompt language combinations with varying levels of granularity. Granularity means how specific you are when explaining what your input is. The risk with more granular prompt language is that it might not be completely correct across our entire data variance.

In the example above, I say “various text sources” which is less granular than saying something like “from blog posts” and even more granular “from blog post titles and abstracts.” But if our dataset contains text from blogs, research articles, and reports, the granular text would not line up well with the language differences across the sources. This means GPT-3 will try to apply the same rules across different types of text because we said in our prompt language that it is all the same.

The prompt language also includes our classes we want to use for classifying this document text. We can set up a prompt variable for this that simply lists the available classes in the prompt. I recommend providing a bit of context around what the class entails for each class, considering we’re using a zero-shot environment and it's difficult as is to correlate the input documents to classes. This can be as simple as a short description of what the class is alongside the keyword. We’ve seen that this extra information can go a long way in classification and we’ve used it for use cases like classifying products to the Google Product Taxonomy by leaving the upstream categories in when laying out what classes are available. It’s much easier to correctly categorize “Apparel & Accessories > Costumes & Accessories > Masks” than just “Masks.”

Document Input

The text that was extracted and preprocessed from the previous steps is added to our prompt. In some use cases this can be from multiple sources.

Process GPT-3

Now that we’ve created our zero-shot prompt for high class document classification, we can process through GPT-3. If we have a fine-tuned model, we can leverage that instead of the base model. It might sound like it doesn’t make much sense to talk about fine-tuned models when we’re constrained to zero shot, considering we normally focus on zero shot when we don’t have enough data to use few-shot or fine-tuning. But there are a number of ways we can still leverage fine-tuning to increase our accuracy.

Using an existing document classification dataset to create a fine-tuned model can actually increase our accuracy in a different document classification use case with different classes. If the input documents are relevant and can be fed to GPT-3 in the same format we can leverage them as a way to show GPT-3 how to accomplish a similar task. This is a great way to get your zero-shot prompt off the ground and give the task agnostic model more of an understanding of your specific task in a transfer learning type setup.

A new method proposed by Google focuses on fine-tuning language models on various tasks phrased as instructions and then evaluating them on unseen tasks. The fine-tuning uses a number of different setups (zero-shot, few-shot, CoT) which allows for better generalization to these unseen tasks. This is a great way to give GPT-3 a better understanding of correlating task specific instructions to outputs and the dataset uses a bunch of classification use cases.

“Scaling Instruction-Finetuned Language Models” (source)

Here’s a quick overview of the 1,836 different tasks used for fine-tuning.

1,836 different tasks that can be fine-tuned for with gpt-3

Post Processing GPT-3 Output

In the post processing stage, we can generate confidence metrics focused on understanding how confident GPT-3 is in the class that was chosen. We leverage the logprobs that are generated for each token and a custom algorithm that understands the correlation between logprobs and the model’s confidence in the output.

‍

Want to implement document classification for your business?

You explored some advanced techniques for document classification in this article, techniques that were invented to solve the real-world problems most industries face. Width.ai builds custom document processing solutions for use cases (just like these!) that you can leverage internally or as a part of your product. Schedule a call today and let’s talk about if document processing software is right for you. Contact us!

‍

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). “Attention Is All You Need”. arXiv:1706.03762 [cs.CL]. https://arxiv.org/abs/1706.03762
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. arXiv:1810.04805 [cs.CL]. https://arxiv.org/abs/1810.04805
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, Jimmy Lin (2019). “DocBERT: BERT for Document Classification”. arXiv:1904.08398 [cs.CL]. https://arxiv.org/abs/1904.08398
Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Yishay Carmiel, Najim Dehak (2019). “Hierarchical Transformers for Long Document Classification”. arXiv:1910.10781 [cs.CL]. https://arxiv.org/abs/1910.10781
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou (2019). “LayoutLM: Pre-training of Text and Layout for Document Image Understanding”. arXiv:1912.13318 [cs.CL]. https://arxiv.org/abs/1912.13318

‍

Lets Talk