Width.ai

Keyword Extraction Using GPT-3 In The Education Industry | Case Study

Matt Payne
·
August 21, 2023

EAS is an education industry SaaS company that needed a way to process unstructured and raw college course information into different database fields automatically to use throughout their tool. 


The format used for college course names, school codes, course levels and more varies heavily across all the universities in the United States, and the raw data comes in with no standardized order or relationship that allows for rules. On top of that, we wanted to extract some key topics discussed in the class from the raw descriptions that allow you to build search engines and other tools based on those keywords.



These are all real examples of criminal justice titles and course codes that must be extracted correctly from different schools. The variance can be seen.


A look at the raw data labeled “Text”. We extract exact keywords from the text, as well as “higher order” keywords that are contextually similar to the input. These higher order keywords come from a learned relationship that we optimize for in GTP-3



The extremely high amount of variance seen with university course information is the most difficult challenge that must be addressed when building our extraction tool. The other very difficult problem we solved in this solution is producing contextually similar keywords without raising the temperature parameter in the GPT-3 model. Given our exact keywords are extracted in the same model we have to balance the temperature to ensure our argmax extracting works while allowing the model to adventure to generate our higher order keywords.

Our NLP Solution

We built a custom NLP software pipeline to take in these raw data inputs, move them into our custom GPT-3 model, and into the correct database columns with correct classification of each data point. 


1. Deployable software pipeline optimized for runtime using Google Cloud services and GPUs.

  • Sub 4 second runtime for the entire pipeline. 


2. Custom Built GPT-3 Model

  • Extracts keywords and higher order keywords from the text. Higher order keywords not only achieve high contextual accuracy but are so good they show up in course “matches” from other universities. 
  • Extracts named entities that are required for each column. These include course title, school name, course code, school department, course description, and anything else deemed “important” by the model's training.
  • Uses an optimized prompt build by our in-house generative model optimization algorithm. GPT-3 hyperparameters are tuned as well.
  • Named entities that require classification are classified as well. These are entities such as course code classified to major level.

3. Deployed databases to handle our tuned data.

  • Easy install and management via docker.


Results

This NLP software pipeline significantly reduces the workload needed to process this incoming data. The keyword extraction allows us to use this data inside the SaaS product for search engines and data clustering. This extremely time consuming and important process is now completely automated for the client. 



Our GPT-3 model achieves over 91% accuracy when extracting the entities from the raw text. On the keyword extraction front our model competes with the accuracy of the small spaCy model, which is quite impressive considering the few shot learning of this GPT-3 tool. In terms of pipeline runtime our entire process runs in under 4 seconds. 


Let’s figure out if GPT-3 software is right for you.

Talk to a machine learning consultant today.