Improve Your Product Catalog Optimization with Ai in 7 Easy Steps
What are the key benefits of product catalog optimization and how you can remove 90% of the manual labor required with Ai
Despite ubiquitous digitization in every industry, text on paper and other physical media has only been increasing every year. The global text processing market was valued at $7.46 billion in 2020 and is projected to grow by 16.7% every year. Industries like legal tech, insurance, accounting, and even retail have critical business needs for robust text processing systems in their business processes.
In this article, we'll explain how we implement robust, production-ready text extraction systems for multiple industries. Our system can extract text from a variety of formats — including image files like PNG, JPEG (JPG), TIF, document formats like PDF or text files, and archival formats like IMG and ISO. We explain how you can extract text from images using Python without Tesseract (a popular framework for text recognition) and why you have to take that approach to build an excellent text extraction system.
Let's explore some business use cases that work perfectly with text extraction models.
Document information extraction is about extracting text from any type of document and interpreting the text — their meanings, semantics, and real-world contexts — just like people do.
For example, legal documents contain not just the main legal content like judgments and agreements but often include important filing information, dates, handwritten details on cover sheets, or handwritten corrections. Ideally, a law firm's document management system should recognize all that and make them searchable because they may record important facts on the ground, like critical hearing dates or crucial corrections.
Our robust document understanding system enables your law firm to extract information from legal documents. We save your law firm time, money, and the possibility of faulty data entries.
Our custom built data extraction pipeline allows you to extract key data points from scanned documents, receipts, purchase orders, and more automatically. This removes the manual labor required for tasks such as data entry and invoice processing. We integrate this pipeline that works on a huge number of invoice formats into your business workflow in a few easy steps.
We have automated warehouse workflows and improved storefront operations by deploying our text extraction system for our retail and e-commerce customers. They can capture and extract product labels, bar codes, and other information that's critical for both back-office and storefront management in the retail and e-commerce industry.
Accurate transcription of medical documents is necessary to deliver high quality of healthcare, avoid legal liabilities, and resolve insurance problems smoothly. Our system can accurately extract text information from medical records, patient forms, prescriptions, handwritten opinions, medical imagery, and more.
We use the same text extraction system for all three use cases, though they seem so different. That's because our system can generalize well but, at the same time, is also flexible and customizable. For new customer data, we just need a few dozen documents — regardless of file format — to fine-tune our system and have it produce accurate results.
Let's start exploring how we have implemented our text extraction pipeline, starting with some basic concepts you should know for a foundational understanding.
Text detection refers to estimating which pixels in an image belong to text content.
Optical character recognition (OCR) refers to identifying characters using only the pixels in an image.
Text recognition refers to recognizing higher-level entities like characters, words, sentences, paragraphs, language, and other concepts of text organization using any kind of real-world knowledge such as language models and document layouts.
Information extraction refers to understanding the semantics and purpose of a piece of text.
Text extraction often refers to the overall question of how to extract text using all three subtasks — detection, recognition, and information extraction.
Scene text refers to text that's incidentally present in a photo, such as text on product labels, billboards, traffic signs, vehicles, and so on. In contrast, dense text refers to text in images where text is the primary content and the focus, such as text in books, invoices, and documents.
Tesseract is a popular software for OCR. It consists of the tesseract-ocr engine and language-specific wrappers like pytesseract for Python. Older versions of Tesseract used a combination of image processing and statistical models, but the latest versions use deep learning algorithms.
Unfortunately, as the image above depicts, we find Tesseract too unreliable and inaccurate for any production use cases. It shows low recall (i.e., high rate of missed detections) and high character error rates (CER). It’s frequently unable to recognize clear printed characters that are easily recognized by people. It's practically incapable of recognizing handwritten text. Even simple words are misrecognized and broken up into meaningless fragments. Tesseract invariably requires heavy post-processing pipelines to improve its results.
While it's easy to use, its simplicity comes at the cost of accuracy. That's why we use more accurate alternatives in production. However, we still use Tesseract for prototyping as it provides a good baseline to judge the success metrics of other techniques.
An alternative approach to Tesseract is to use traditional text processing techniques that typically involve the following stages:
These steps are tested and adjusted while visually examining a representative set of input images in hopes that they'll work well for any unseen image. However, in practice, they don't generalize very well, have lower recall than Tesseract, and often require manual adjustments to parameters.
People use them because they are simple, easy to implement, run fast on any hardware, and set a reference baseline for success metrics. A complex technique that can't improve on the metrics of these techniques is not worth implementing. They're also handy for quickly creating training datasets.
Unlike manually adjusted image processing code, statistical or probabilistic models learn how to isolate and recognize text from data samples and generalize better on unseen images. However, these models lack the expressive power of random forests or neural networks and are just slightly better at generalization. OpenCV, the computer vision framework, provides simple models like extremal region (ER) features and hidden Markov models (HMM) for text detection and text recognition.
ER text detection and recognition are based on the concept of extremal regions — contiguous regions of pixels that have a maximum or minimum intensity over all their surrounding boundaries. The idea is that text characters are likely to be extremal regions because people should be able to read them. The ER detector is robust against blur, contrast, illumination, orientation, color, and texture variations.
The ER algorithm consists of two classifier stages. The first stage is a character detector where the probability of each ER being a character is estimated using novel features calculated quickly and only ERs with locally maximal probability at all thresholds are selected for the second stage. In the second stage which is a character classifier stage, the classification is improved using more computationally expensive features. Finally, character ERs are grouped into words using an exhaustive search based on common features like word text lines.
The HMM algorithm models text recognition as a probabilistic model. The sequence of pixels forms the observations while the sequence of characters they belong to forms a hidden Markov chain that follows some linguistic rules. The idea is to find which character is the most probable given the current state (current character) and current observation (pixels).
HMM uses parameters like transmission and emission probabilities that can be learned from data. Transmission probabilities are the conditional probabilities of new states given the previous states and are learned from the distribution of characters in a language corpus or dictionary. For example, the probability of the character "e" following "t" is higher than "x" following "t".
Emission probabilities are the conditional probabilities of observations (pixels) given current states. They are learned by annotating character images with character labels. Because probability distributions lack expressive power, the images have to be simple ones containing just one character.
These rudimentary models are considered machine learning because they're learning their parameters from data. They generalize better and provide a less brittle baseline for success metrics compared to manual image processing logic. Since OpenCV is widely available for all types of hardware from smartphones to servers, they're easy to implement and run fast.
The disadvantages are because they lack expressive power and can't generalize as well as more complex models. They suffer from both low recall and low precision. Text is often not detected at all even when present, especially on product labels where text art is heavily used. You’ll notice that anywhere you see a bounding box means the model believes it has found text, and often it misses character or creates false positives.
Unlike the rudimentary learning techniques of OpenCV, deep neural networks (DNN) have enormous expressive power, enabling them to generalize to a variety of industries, document layouts, text semantics, and other real-world varieties of text. OpenCV itself comes with a DNN module and pretrained simple DNN models for text detection and recognition.
We'll explore more modern techniques that unleash the full power of state-of-the-art deep learning for our robust text extraction system.
Our technology stack consists of:
For text detection, our pipeline uses a custom deep learning model based on the TextSnake deep learning algorithm because it generalizes perfectly for all our customer use cases — legal documents, product labels for warehouse automation, invoice processing, and more.
Most text detection algorithms assume that all characters in a text fragment lie in a straight line. Their detected bounding boxes are rectangles that are either axis-aligned or angled.
But the assumption of a straight line isn't true in many real-life situations. Text may be curved along free-form contours with irregular font sizes and orientations. Many product labels, shop signs, building signs, road signs, and handwritten text share these traits.
The novelty of the TextSnake model is that it generalizes to any text geometry and contour. It sees a text fragment as a sequence of circular bounding disks with each disk having a center, an orientation, and a radius. By estimating these geometrical properties, it outputs a contoured bounding area that snakes its way along and around a text fragment.
Our TextSnake model uses a fully convolutional network (FCN) backbone with an overlay feature pyramid network and cross-connections between layers. Its fully convolutional layers and the feature pyramids enable it to accurately upsample feature maps so that information lost in the downsampling layers is rebuilt again to yield pixel-level labels for the original image. So this architecture is effectively doing instance segmentation of text fragments.
The output layer consists of classification nodes for the text region and text center, and regression nodes for the geometry parameters of the circular disks that constitute a text fragment — their centers, radii, and orientations.
MMOCR's pretrained TextSnake model is already accurate for most scene and document text detection use cases. In some rare use cases, like warehouse automation or handwritten documents with specific peculiarities, we fine-tune the pre-trained model's weights on our custom image datasets.
The custom training dataset is prepared by marking text fragments with a custom annotation tool. Since we're only interested in detection, we don't need character-level labeling. A typical fine-tuning dataset consists of between 20 to 100 annotated images depending on how inaccurate the pre-trained baseline was.
For fine-tuning, the layers of the pre-trained FCN model are frozen except for the last few upsampling, classification, and regression layers. We use smoothed-L1 as the regression loss function and cross-entropy loss for the text region and text center classification.
The results of TextSnake's detection on a legal document are shown in the image above. You can see how it effortlessly detected and highlighted handwritten and irregularly oriented text.
When we read text, we don't rely merely on what each character looks like. Instead, we implicitly use our linguistic knowledge — spellings, grammar, and semantics — and real-world knowledge to identify words and their contexts. That's why we can read and understand even sentences with typos. The human behavior of combining multiple types of knowledge — called modalities — for a task can be simulated using deep neural networks.
In the context of text recognition, basic OCR relies only on visual features and ignores other modalities, which leaves it weak against spelling mistakes and bad handwriting. But robust text recognition goes beyond basic OCR, combining visual features with language models and real-world knowledge graphs.
Most deep learning models have attempted robust text recognition by combining convolutional neural networks (CNNs) for visual features and recurrent neural networks (RNNs) for sequential language modeling. But RNNs have a limited range for linguistic context and are slow to train. And while CNNs excel at local visual context, they can't model long-range context.
Transformer models are the next evolutionary step after CNNs and RNNs because they don't have the limitations of either. Our text recognition pipeline uses transformers to combine visual features and language modeling, enabling us to replicate the way people read text. We explain our transformer-based OCR model, or TrOCR, next.
TrOCR has two connected neural networks — the encoder block and the decoder block. The concept of self-attention is important here — it tells the neural network how much importance to give to each feature and its surrounding context relative to other features. Both the blocks consist of multiple self-attention layers such that each prediction influences the next prediction just like an RNN.
The encoder block transforms the input image and position embeddings into intermediate attention vectors that encode all the local and broader visual features and their contexts. The decoder block interprets these intermediate attention vectors to generate an output sequence — in this case, a text sequence. In this way, TrOCR's encoder and decoder work as a sequence-to-sequence translator that converts input image patches to a text sequence.
TrOCR's encoder and decoder blocks are initialized using respective pre-trained transformer models. The encoder is initialized with a pre-trained BeIT image transformer. The decoder is initialized with a pre-trained RoBERTa language model. The entire TrOCR model is then pre-trained on a large text recognition dataset. Finally, two fine-tuned models are produced through multiple training iterations — one fine-tuned on a small dataset of printed text dataset and the other on a handwritten text dataset.
The precision, recall, F1 score, and CERs are consistently 85% and above for all types of documents and layouts, including handwritten text.
The final component in our pipeline is a text extraction model that infers semantic meanings of text fragments based on their visual, positional, and linguistic attributes. For example, given a sales invoice, it infers that some text refers to products, some to prices, some to an address, and so on.
A deep learning model generalizes to all types of documents and document layouts, avoiding the problems of using document templates, layout settings, or hardcoding. Our model is based on the ideas from the paper, spatial dual-modality graph reasoning (SDMGR) for key information extraction.
SDMGR is essentially a multi-class classification network that outputs an entity type label — like the product, person, address, and so on — for each text entity. Our model has three important characteristics.
The first characteristic is the reuse of the visual and linguistic features produced by the TrOCR text recognition component. The original SDMGR model proposes a convolutional segmentation network and a bidirectional long short-term memory recurrent network to model the visual and linguistic features.
But because our TrOCR model has already extracted visual and linguistic features using far more generalizable pre-trained models like BeIT and RoBERTa, we don't need to do feature extraction again. We also reuse the attention vectors produced by TrOCR.
The second characteristic of our SDMGR is a fusion module that combines the positional, visual, and linguistic features of text fragments. This is unlike named entity recognition (NER) models that use only linguistic features. The fusion module is a secondary, small multi-layer perceptron network to learn the optimum weights for combining these features and attention vectors.
The third, and most important, characteristic of our SDMGR model is a graph reasoning module based on an attention mechanism to model the spatial relationships between text entities. This multi-layer perceptron network treats the text entities as nodes. The relationships between text entities are modeled as edges. For each pair of entities, their horizontal and vertical distances and text shapes (widths and heights) are embedded in a lower-dimensional space and modeled as edge weights to optimize.
Using an entity annotation tool, we annotate the text entity positions and labels for 20 to 100 representative documents from our customer dataset.
We then take the pre-trained SDMGR model from MMOCR and fine-tune it on our TrOCR's extracted features, attention vectors, and custom document layouts. The idea is to tune our SDMGR's fusion and graph reasoning networks for the linguistic features and spatial positioning that are unique to our customer dataset. The classification loss function we minimize is the cross-entropy loss of predicted labels.
Our model achieves excellent F1 scores of 80-90% and above for each entity type.
Post processing modules once the text has been grabbed from an image often include NLP models to handle what we call text reasoning above. Oftentimes extracting text is only the first step in a workflow pipeline, and processes such as named entity recognition, keyword extraction, and dependency matchers are required to create meaning from the raw text.
In this article, you have seen how complicated a problem text extraction is and why you need robust approaches for production deployments. With your expertise in deep learning and computer vision, we can provide your business with robust, production-ready systems for all your text processing needs. Get in touch.
What are the key benefits of product catalog optimization and how you can remove 90% of the manual labor required with Ai
The key components to building a gpt-3 summarizer with short & long-form summarization for news articles, blog posts, legal documents, and more.
Here's 5 of the most valuable ways to convert unstructured text to structured data with natural language processing
7 different ways to extract valuable information from unstructured text using algorithms such as GPT-3, spaCy, and LDA.
Learn about modern applications of AI in nutrition, from food identification using deep neural network segmentation to estimating calories and nutrition.