Most marketplaces or wholesalers receive product data from vendors in two ways:
While this provides vendors with a list of the SKUs, it creates a few problems when wanting to onboard these products into their catalog.
The flat files generally do not contain product data that is ready for listing. It is usually sparse, does not contain descriptions and images, and is really used just to let you know which SKUs to sell. It is on the customer to enrich this data themselves based on spec sheet info or manufacturer page info.
If the customer is provided just a PDF spec sheet they have to manually go through the data and pull out product information to create product records. If the vendor does not provide a folder of images, you also have to go out and grab images for these products as well.
These are both extremely time consuming processes and are what content teams spend most of their time doing. The problem gets even worse if you sell products through third party channels like Walmart and Amazon. These channels have specific requirements for content that has to be adjusted for which means you might have to write 2 or 3 product records for the same product!
At a high level it seems simple - automate the process to save money and time spent on enriching the data we already get from vendors. But there’s a few key points that you might not think about when considering automating the process. We’ve done it 100s of times, so these feel like second nature to us.
1. Automating this process allows us to onboard more products per month, which allows us to carry more products and generate more revenue. Automating product data enrichment doesn’t only focus on cost saving for what we’re currently doing. Many orgs only onboard the amount of products per month their content team can handle. We hear this all the time on our initial calls “we onboard 2,000 SKUs per month, but we’d love to onboard more”. They understand that by onboarding more products they’ll be able to offer customers more options. By automating product date enrichment we can remove the limit set by the amount of manual resources.
Accurate product descriptions with detailed information that customers are looking for is critical for SEO and on-site conversion rates. You don’t want descriptions with a bunch of vague information that doesn’t help move the customer towards buying this product over what is offered by other vendors.
2. Better on-site conversion rate and customer satisfaction with high quality data. Clear, comprehensive data is essential for both SEO and boosting on-site conversion rates. High quality product data helps guide customers in choosing your product over competitors, helping them feel confident in making a purchase from you.
3. Consistency of listings is an issue that arises with manual creation of product data. Different people on the team have different opinions of what product data should look like when accounting for variance for products across categories. Most teams feel like they’re aligned on how product data should look, but when you dig deeper into the product data being pushed has a ton of differences. This is also driven by differences in vendor provided product data. Some vendors provide attributes, so provide descriptions, some provide images, and when different people handle the differences in this data, they can leave stuff out or include things that are missing from other vendors. Here’s an example:
Here’s a vendor file with quality data. We’ve got a name, a short description and a few attributes. A manual content team member would have no issue turning this into a quality listing with keywords correctly placed and all relevant information included.
Here’s an example of what products usually look like in a vendor flat file. Shorter description and missing some attributes. The attribute “weight” is missing a unit of weight.
The manual content team member would not know the color of the product and could leave that out of the listing. The more important information missing is the “made with PFAS chemicals” note that is required for this product. Since it's not included in the vendor file the person enriching the data might not know they need to include that for this product. One content member includes it for their product that has better data, one doesn’t.
Automating product data enrichment actually leads to more consistent data creation as these systems are designed to follow specific formats and ensure that certain data is included. You set the business rules for creating content listings, and all listings will follow the same standard. The question is, what do we do in the second scenario if data is missing? We’ll look at that with our workflow in a bit.
As mentioned before, this issue grows even more when looking at third party channel selling.
Let’s say we have the same high quality data from a vendor that we can use with minimal changes for our own product listings.
We want to sell this product on Amazon, Walmart, and eBay. Each of these marketplaces have different listing requirements that we need to meet to be able to list like:
Our SaaS product Pumice.ai was built to automate the process of product data enrichment and SKU onboarding. We’ve built API endpoints for all parts of the product data enrichment process that allows you to go from sparse flat file or PDF spec sheet to enriched content that is ready to be listed on your own catalog or third party channels. We have built in rules for converting your content into third party accepted product listing per marketplace.
All of our endpoints use customized Ai models to enrich your content with completely customizable logic. Let’s walk through going from junk vendor data to clean and enriched product data.
Your account rep will give you access to the models associated with your plan. Our pricing and plans are based on the number of endpoints you need access to for the enrichment you’re trying to do. There’s no point in paying for the bullet point generator if you don’t need bullet points! Once you’ve got your plan in place and are onboarded you receive the api keys for your endpoints. We also assist with implementation and integration to make sure you get everything to start the enrichment process!
A quick note for this process. You can either call each API endpoint individually, or our implementation team can set up a custom workflow API that calls all endpoints with a single request. This is common for people who know exactly what there workflow needs to look like and don’t want to change it once it’s set in place. This also makes the data management a bit easier.
There are a few ways that you can pass data to Pumice.ai to kick off jobs for enrichment. The first one is through a JSON body with each api call containing the data you want to use. Our endpoints support both image and text inputs and use these to perform the ai operations.
The next one is via a CSV upload where you define the proper columns that are required for that specific endpoint, and will receive a CSV back with the enriched data. We have UIs for each endpoint, and you get a UI for your custom workflow if you have one.
The final option is uploading a spec sheet to our spec sheet processor. This endpoint will break down the PDF spec sheet into individual products in a CSV output format. You can use spec sheets to either enrich products from a flat file, or use it to create products.
Let’s look at a common workflow.
The first step in customer workflows usually focuses on is gathering structured data to enrich product records. It's impossible to enrich the vendor provided product records correctly without access to external data that we can use for each product record. Some people try using ChatGPT to enrich product data by just asking the model to write a new title or description for the product, but due to ChatGPT not having access to the actual manufacturer product data, the model will hallucinate the newly created product data. Remember, LLMs only have access to the data they were trained on, and ChatGPT might only have been trained on data from 2024 and before. This means the product data you get from ChatGPT could be entirely unrelated to the actual product you’re trying to enrich. Let’s look at this example to understand.
Here we’ve loaded up a product into ChatGPT from a vendor flat file for a mirror. We’ve got an image, a title, and a short description. We’re going to have ChatGPT create the new data.
At a first glance the results don’t look too bad. All the data was generated and it seems related.
Here’s the problem - it’s completely hallucinated. The category does not follow the vendor's taxonomy, The finish of the wood is not walnut, it’s honey brown finish. The height and width are completely wrong. There is no mention of the wood species or the veneer material. All of this data lives on the vendor's website and is live right now.
It’s not really the language model's fault. As I mentioned before they do not have access to live data, they do not know the vendors taxonomy or required specifications. So many customers we work with were previously trusting ChatGPT results on their product data that is completely hallucinated. Back to the main topic - our data augmentation.
Pumice has two different ways to augment the data provided in vendor flat files for enrichment. The first is our ai smart scraper which goes out to manufacturer websites and finds the product page associated with the row in the vendor flat file. It does this with either the product MPN number, the UPC, or product name, finds the correct URL for the product, and extracts the required information from the web page. The extraction of information is completely customizable with a text prompt for what information you want to grab from the page for enrichment. This allows us to get additional information about the product and do our Ai processes with real product data.
But what if the product data the tool grabs is not the correct data, and is actually for another product? Our smart scraper has a built in data validation agent that verifies the newly extracted product data against the original record.
Let’s say we have this product above in our vendor flat file with a simple title, an MPN, a product description, a single image, and the manufacturer domain. We take the title, MPN, and domain and first use our universal search API to find the product pages on the manufacturer domain that match either the MPN or title. We can see we get one URL back.
We then take that product page url and run it through our smart scraper api with the text prompt “extract all product data including title, description, attributes, and image urls” and our resulting JSON returns all the information on the product page that fits what we asked for. Image_urls are in a list. We can see the returned JSON below.
Now that we have some data to work with, we’re going to be able to use this new data to enrich our existing data for the product record in our flat file. You can extract any data that exists on the product page.
If you have spec sheets from vendors alongside your flat files you can use the spec sheets. Most of our customers that get pdf spec sheets
One of the key ideas we offer with Pumice for all of our ai endpoints for the enrichment process are what we call custom rules. Custom rules allow you to add customization to the outputs that are generated by the different endpoints to fit specific schemas, business logic, or information that you need. These are natural language rules that are provided in a list to the endpoint, and the ai models make sure they follow these instructions. Examples of our most popular ones look like:
These custom rules are critical to ensuring your product data is standardized across the board in your catalog. If you’re enriching data to sell through third party channels, we have prebuilt rules already in place for each channel.
Here’s an example of using custom rules with our product title generator to generate titles in a specific format and making sure we use specific keywords in the title. How can I ensure data quality and make sure the generated data actually follows the rules? We have a checker agent that takes in your rules and is run against newly created product data that can be used to verify rules are followed. If they are not, it will regenerate the data with missing rules.
All of our endpoints allow you to include examples of existing high quality data for the models to follow when creating new enriched data. These are examples of good titles, descriptions, bullet points etc. When these examples are provided they further the models understand of how to complete the tasks correctly for your use case, not the models underlying idea of how to properly create our data (write a title etc).
We recommend providing examples alongside your custom rules to further the relationship between the input product data and the information provided through rules and examples, especially when your rules are a bit more vague. Rules like “use relevant keywords” or “use this tone” should be supplemented with examples to help the models know what relevant keywords are!
Now that we’ve got our augmented data, we’ve got some rules and examples lined up, and we know how we want to call APIs. Let's look at our Pumice.ai endpoints for product data enrichment.
We have a number of endpoints built for product data enrichment that will allow us to go from the junk data we saw before plus the extracted manufacturer data to clean product data.
Our product categorization endpoint takes in text data and images and assigns the best fit category with a confidence score based on the provided taxonomy. You can use either a custom taxonomy or a popular prebuilt taxonomy like Google Product, Amazon, or Shopify. We upload your taxonomy to the Pumice backend and provide you with a custom API key for that specific model. The taxonomy must follow the Google Product Taxonomy standard and examples of products being matched to examples can be used.
We do not have a limit on the size of the taxonomy or the number of levels used, and we’ve been able to achieve 97% accuracy at the 5 level deep mark in use cases like this. We do offer fine-tuning for this endpoint which allows us to train the model on the exact relationship between your product data and the taxonomy which greatly improves the accuracy in difficult use cases or very large taxonomies. We also support multilingual datasets and have seen high end results for these inputs (case study).
The API can take 10 products per second with our standard API, but our bulk endpoint can take millions of products per day.
The title generation endpoint takes in product text and images alongside any custom rules and examples to generate a new title.
Description generation is practically the same as the above. The most common rules we see with description generation are:
- Controlling the length or number of paragraphs in the description.
- Ensuring specific keywords are not mentioned in the generated description. When using manufacturer product data its important to make sure information that is specific to them does not leak into your enriched data. Manufacturer product pages will often contain shipping and warranty information that applies to them, but isn’t the same as yours. We can use a rule here to remove any warranty or shipping information.
- Adding HTML tags
- Adding SEO keywords
Accurate product descriptions with detailed information that customers are looking for is critical for SEO and on-site conversion rates. You don’t want descriptions with a bunch of vague information that doesn’t help move the customer towards buying this product over what is offered by other vendors.
Our attribute extraction endpoint looks at the provided product data and works to directly extract attributes from the data. If the value for the attribute does not exist in the data this endpoint will not extract it.
This endpoint also allows you to provide a list of attribute keys for the model to use, if you have specific attribute key-value pairs you want to extract. If you do not provide a list of attribute keys, the model will decide them based on the data provided.
Here’s the product data we extracted before.
We provided a list of attribute keys to use with the custom rule stating to only use those attribute keys. We can see it returned attribute values for keys provided with data that exists in the provided product text. We can see terms like “Premium wood”, “Pre-sharpened Pencils” and “Yellow” directly in the text. But we see a few N/As in the results, which mean that there was no information that existed directly in the provided product data for those keys. The model couldn’t find any information that told us the lead color or Grip type. So what do we do?
We have a second attribute endpoint that solves this. Our attribute prediction endpoint works to “predict” what attribute values make the most sense when provided attribute keys. It looks at images and text just like we saw before, and can be provided an attribute key list. We’ll take any attribute values that could not be directly extracted from the product data and pass them to this endpoint to predict what the values are.
We provide the list of attribute keys to use and provide attribute value options. With either endpoint you can provide a list of possible values for the model to use for each key. This is very popular if you have a set list of attribute values for fields like color, size, etc.
We can see we get out new values for the keys we couldn’t get values for with the extraction endpoint. The model looks at the image, recognizes its a pencil, sees its shape and returns it. The model also recognizes that the lead color is graphite.
We have an endpoint where you supply your same product data, text and images, and the model generates 5 bullet points. This model still has examples and custom rules which allow you to customize the information or format of the 5 bullet points. Bullet points are most commonly used to append to the end of descriptions for SEO purposes, or for third party channels that require feature bullets like Amazon. You can even use this endpoint to generate question and answer pairs potential customers might ask about the product to rank for those question based google keywords
Now we’ve generated all of our data and can return it in a JSON format or csv. We can see we went from sparse empty data to fully enriched data with multiple fields.
Outside of our core product enrichment endpoints we have a number of additional endpoints customers use to further prep their data for use in a catalog or third party channel.
As we mentioned above this agent takes in your rule set or any other valuable data you want to provide like examples, and reviews the outputs generated to confirm the generated product data aligns. We use a custom reasoning model for this which provides us great feedback if the outputs are incorrect based on the rules. The endpoint will rewrite the input product data if the rules do not fit. This is very useful for ensuring data accuracy in more complex cases or use cases where we want to double or triple check the data. Customers that sell products that are commonly flagged on Amazon such as guns or food use this agent to ensure everything is good before product data is pushed.
We have a number of endpoints built that are used to automate business operations around images. High quality images improve on-site conversions and improve buyer trust.
Effective data enrichment is useless unless we can onboard the products. we Once your enriched data is ready you can merge it into your catalog. If you’re using the UI with CSVs you can upload your records into your PIM or catalog management software. We recommend using the same naming convention when uploading to Pumice as the one required by your product data software. If you’re using the direct endpoint APIs you can directly hook them up to whatever software you use for product catalog.
One of the key reasons we’re an API driven product is we want customers to be able to keep their existing workflows and data governance in place with minimal changes when adding Pumice. Everything stays the same for every other part of the pipeline, just with added automation! Our implementation team works with you to make sure you can integrate everything you need.
Some customers like to set up a queue where newly improved data goes before being merged into the catalog. This allows them to manually review specific fields or categories of products for data integrity if they still want to do that.
Are you ready to explore automating your product data enrichment and SKU onboarding with ai? We’d love to show you a demo of what our product can do for you. Contact us today for a demo here. Our Usage based pricing allows you to slowly ramp up into the level of automation you’re comfortable with.