Blog Posts

7 NLP Techniques for Extracting Information from Unstructured Text using Algorithms

Matt Payne
·
August 15, 2023

Transformer architecture that makes up many of the NLP models below

As the field of artificial intelligence advances, so does the capability of machine learning to interpret and extract information from human language. This is particularly relevant in the realm of natural language processing (NLP), where machines are tasked with making sense of unstructured text data. There are a number of natural language processing techniques that can be used to extract information from text or unstructured data, and in this blog post we will explore a few of them. These techniques can be used to extract information such as entity names, locations, quantities, and more. With the help of natural language processing, computers can make sense of the vast amount of unstructured text data that is generated every day, and humans can reap the benefits of having this information readily available. Industries such as healthcare, finance, and ecommerce are already using natural language processing techniques to extract information and improve business processes. As the machine learning technology continues to develop, we will only see more and more information extraction use cases covered. 

Let's take a look at a few natural language processing techniques for extracting information from unstructured text:

1. Named Entity Recognition using spaCy

spaCy logo

Named entity recognition (NER) is a task that is concerned with identifying and classifying named entities in textual data. Named entities can be a person, organization, location, date, time, or even quantity. SpaCy is a popular Natural Language Processing library that can be used for named entity recognition and number of other NLP tasks. It comes with pretrained models that can identify a variety of named entities out of the box, and it offers the ability to train custom models on new data or new entities.

For the most part, NER models are trained on a per-token basis. That is, for each word in a sentence, the model predicts whether or not that word is a named entity that we want to fine. In spaCy, this is done using a bi-LSTM neural network that takes as input a sequence of words, and for each word it predicts whether or not it is a named entity. It then uses the information from the words around it to make a more informed prediction.

We can use things such as part-of-speech tags, dependency parse trees, and entity type information to help the bi-LSTM neural network make more accurate predictions as it works to learn the relationship between language and named entities.

2. Sentiment analysis with GPT-3

sentiment analysis example on gpt-3

GPT-3 is an autoregressive language model used for a wide variety of tasks including sentiment analysis. When given a sentence, GPT-3 will analyze the sentiment and generate a prediction. The predictions are made by taking into account the context of the sentence as well as the word choices. An example would be a text document that contains strong negative connotations such as "hate" or "I'm not a fan of them" which is likely to be predicted as having a negative sentiment. GPT-3 is not only able to predict the sentiment of a sentence, but it can also generate an explanation for its prediction. This makes GPT-3 a powerful tool for sentiment analysis, as it can provide not only a prediction, but also an explanation for that prediction. This can be helpful in understanding why a particular sentence was predicted to have a certain sentiment, and can also help in troubleshooting data science errors.

Sentiment analysis is already used for things such as social media monitoring, market research, customer support, product reviews, and many other places where people talk about their opinions. We’ve had a ton of success building these applications like this one for Twitter. 

3. Topic modeling with Latent Dirichlet Allocation

LDA architecture

Latent Dirichlet allocation (LDA) is a topic modeling technique that is used to discover hidden topics in text such as long documents or news articles. It does this by representing each document as a mixture of topics, and each topic is represented as a mixture of words. LDA is an unsupervised learning algorithm, which means that it does not require training on new labeled data. This makes it a powerful tool for discovering hidden structure in data that can be used quickly. LDA allows you to find out what topics are being talked about in a document, and how often those topics are mentioned. It can also be used to find out what words are associated with each topic.

4. Part of Speech Tagging using spaCy

POS tagging with spaCy

Part-of-speech (POS) tagging is a process of assigning a grammatical category to each word in a sentence. The categories can include verb, noun, adjective, adverb, and so on. Each word is tagged with the category that is most appropriate for that word in the context of the sentence. For example, the word "fly" would be tagged as a verb in the sentence "I like to fly." POS tagging is a helpful NLP technique because it can provide context for words and help you build a better understanding of key information in unstructured text. This context can be helpful in many tasks such as named entity recognition, sentiment analysis, and topic modeling, or used as stand alone extracted information. 

SpaCy has a POS tagging model that can be used in an NLP pipeline for quick information extraction. The model is pretrained on a large corpus of text, and it uses that training data to learn how to POS tag words. spaCy POS tagging also allows for custom training data, which means that you can train the model to POS tag words in a specific domain such as medical texts or legal documents. We've used the POS tagging model as a standalone to write entity extraction rules that enhance the ability of our NER or deep learning models.

5. Text classification with Scikit-Learn: 

Text classification is the task of assigning a class label to a piece of text based on a learned relationship between information in the text and the class. This can be done for a variety of purposes such as spam detection, sentiment analysis, topic classification, and so on. There are a number of different algorithms that can be used for text classification, but in this section we'll focus on the popular scikit-learn library and two different methods of text classification. 

scikit-learn logo

Scikit-Learn is a machine learning library that can be used for a variety of tasks, including text classification. It offers a number of different text classification algorithms, and it also allows for the creation of custom algorithms and pipelines. In this section we'll focus on two of the most common text classification algorithms: support vector machines (SVMs) and naïve Bayes. Both of these algorithms are based on the idea of using a training set of data to learn the classification rules. The training set is a collection of documents that have been labeled with the correct class label. For example, in a text classification task, the training set would be a collection of documents/sentences/paragraphs that have been labeled as a specific class. The classification algorithm would then learn a relationship between the classes and the examples that maps the two together. 

Example of classification with Support Vector Machines in Scikit-Learn
Example of classification with Support Vector Machines in Scikit-Learn

Support vector machines (SVMs) are a type of supervised machine learning algorithm that can be used for tasks such as text classification. The algorithm works by finding the hyperplane that maximizes the margin between the classes. In other words, it finds the line of best fit that separates the different document classes. Once the hyperplane has been found, the algorithm can then be used to classify new pieces of text. The key benefit of support vector machines is that they can be used for text classification tasks with a large number of classes and still result in strong accuracy metrics. This benefit comes at the cost of increased training time, as the algorithm has to find the hyperplane that maximizes the margin for each class.

Example of Bayes classification on the iris dataset. Scikit-Learn
Example of Bayes classification on the iris dataset. Scikit-Learn

Naive Bayes is another popular text classification algorithm. It is a type of probabilistic algorithm that makes predictions based on the learned probabilities of the data. The algorithm makes predictions using the Bayes theorem, which states that the probability of something happening is equal to the probability of the event times the probability of the event given the data. In other words, the probability of a piece of text belonging to a certain class is equal to the probability of the text given the class times the probability of the class. Naive Bayes is a popular algorithm because it is simple to implement and it is often very accurate for many popular use cases. The key assumption that the algorithm makes is that all of the features are independent of each other. This assumption is often not true, but the algorithm still often performs well.

6. Key Topic Extraction with GPT-3:

Text document created containing our key topics discussed in an interview about LeadFuze
Text document created containing our key topics discussed in an interview about LeadFuze (Read here)

Key topic extraction is a popular use case that focuses on extracting the key topics discussed in a given input text. This is slightly different from topic modeling as we can use GPT-3 to focus on specific points in the text instead of the broad topics. This can be valuable points made in an interview, key points made in an informational blog post, or important questions asked in an interview transcript. These are usually much more refined key phrases or sentences instead of one word keywords or a list of broad topics. The pipeline used for this often combines both NLP and natural language understanding. 

GPT-3's few shot learning allows for rapid prototyping and training of models without the need for large training datasets and model training. This is perfect for key topic extraction as it can be difficult to find a large training dataset of key topics that fit’s your idea of “correct”. With GPT-3, you can simply provide a few examples of what you consider to be key topics and the model will learn from those examples. This allows you to really refine the model for your specific use case and give GPT-3 a great idea of what you decide is important in this unstructured text. We've used GPT-3 for a number of key topic extraction use cases and have been successful in implementing it on input documents such as financial interviews, legal documents, and other long form documents.

7. Structured Data Tables From Long-Form Text With GPT-3

Ingredient extraction with GPT-3.
Ingredient extraction with GPT-3.

GPT-3 can be used to extract information from unstructured text and convert it into a table format. This is a helpful technique when you have a document that contains a lot of information, but it is not organized in a way that is easy to read or understand. For example, you may have unstructured text that contains a list of products and their prices, but the document is not organized into a table. With GPT-3, you can provide a few examples of what you want the table to look like and the model will learn from those examples so this task is that it can be done with very little training data. This allows you to quickly and easily convert the long-form text into a table that is much easier to read and understand.

 

We've used this use case as a great way to create some initial structured data from product descriptions before hooking the model to a database to store the extracted information. This seems to lead to fewer errors and a cleaner pathway from unstructured text to a database full of results.

 

width.ai logo

These are just a few of the ways that you can use natural language processing to extract information from unstructured data. As you can see, there are a variety of NLP techniques that can be used to extract different types of information depending on what you're looking for. As machine learning continues to develop, we will only see more and more uses for natural language processing, natural language understanding, and natural language generation in our everyday lives. Talk to Width.ai today to see how we can take your unstructured text data and turn it into valuable insights!