SOTA SKU Image Classification for Product Matching | How we outperformed the Fashion CLIP model

Nitin Rai
October 16, 2023

Width.ai created a new retail product image classification model that outperforms the SOTA results from CLIP and Fashion CLIP on the most popular dataset in the domain. These models are commonly used in product matching use cases where photos are taken with lower resolution and zero control for noise and image angles.

If you are unfamiliar with product matching on retail shelves I highly recommend you start with this product matching architecture blog post to understand the entire pipeline, and why we use a similarity based approach for SKU image matching over a class based system.

Width.ai product matching architecture
Width.ai Product Matching

I also recommend taking a look at our previous version of this product SKU image classification model that does a great job of outlining the process for reaching these benchmarks. This workflow is where we went from CLIP to Fashion CLIP with customizations to the architecture to fit this domain.

current system for product similarity

Let’s start with a bit of an introduction to the key pieces of the equation.

What is Fine-Tuning?

Fine-tuning in machine learning involves slightly adjusting the parameters of a pre-trained model for a new, related task. It's beneficial for several reasons:

  1. Resource Efficiency: Fine-tuning reduces the need for extensive computational resources and time as it leverages pre-existing learned patterns.
  2. Transfer Learning: It capitalizes on the concept of applying knowledge from one problem to a related one, such as using a model trained on general images to recognize specific items.
  3. Limited Data Handling: Fine-tuning helps prevent overfitting when dealing with smaller datasets, as the model has already learned general features from its initial larger dataset training.
  4. Allows you to quickly iterate models in an evaluation framework to understand where they stand for a specific task.

The CLIP model, known for its capacity to comprehend and link text and images, is trained on a vast internet corpus. However, its generalized training may not fully equip it to handle specific or specialized content. To maximize CLIP's potential for a particular task or domain, fine-tuning is essential.

Fine-tuning is a pretty standardized part of building product matching systems. Most use cases have unique variables such as setting, camera quality, and product count that create variations of how to accomplish the task. The goal of fine-tuning in this specific part of product matching is to better match the cropped products in the environment to product records in the database. This database has the classes of SKUs we know exist.

Dataset for Evaluation

The dataset we focused on optimizing for is RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification. It contains more than 500,000 images of retail products on shelves belonging to 2000 different products.

RP2K dataset outline

This dataset is awesome for product matching use cases as the images used come from a high noise environment in a cropped from the shelf format. Most of the datasets companies use for training and testing their product matching systems come from Open Food Facts, which are stale product images with very little noise. While these are pictures of the products you want to recognize, they do not look like the products that come from the shelf in terms of angle, size, and brightness. You might see a super high accuracy on the dataset from Open Food Facts, then move to a real product use case and see your accuracy evaporate.

RP2k Sample Images
RP2K Sample Images

You can see there is a ton of variation in color, angle, camera quality, blur and other features that will show up in the real retail shelf recognition. The logos are way easier to see in simple hand held products or ecommerce product images. We used to try to augment this noise to improve the mapping of the clean training dataset to how the images will actually look when compared for similarity. Here’s an example of the image used on Open Food Facts, and it’s clearly cleaner to view than what would be seen on a shelf.

Example image from Open Food Facts

Comparing our result to the RP2K dataset will be a much better representation of the real world use of the product matching.

Model Performance & Comparison

Before we get into the details of the results of our new model let's look at the results of the most common models used for product similarity in the product matching use case. We want to see how well CLIP and Fashion CLIP are at finding matching SKUs in the database. Take a look at the chart below to see how they perform in this task.

model comparison

We can clearly see that for the kind of images that are present in the RP2K dataset, and real product matching use cases it’s very challenging for the baseline CLIP model, and the fine-tuned Fashion CLIP model to perform well. in our evaluation CLIP reached 41% while Fashion CLIP reached 50% Top-1 Accuracy. This means that when we provide an input cropped product image the correct result is returned as the top result 41% & 50% of the time. As you can imagine, these numbers go up as we expand the parameter to the correct result being in the top 3 or top 5 returned results. That being said, it's concerning that the correct result appears in the top 3 results under 65% of the time with either model, and CLIP never moves past 60%.

Let's break down the evaluation method step by step (Top-K Ranking):

1. Creating Embeddings for Training Images:

  • First, we have a set of images in our training dataset, each belonging to a specific class or SKU
  • We process each image through a neural network to obtain a unique numerical representation called an "embedding" for that image.
  • These embeddings are then stored, along with their respective class labels. So, for each training image, we know both its embedding and its class label.

2. Computing Embeddings for Test Images:

  • Next, we have a separate set of images in our test dataset, and we want to evaluate how well our model can recognize and classify these test images.
  • Similar to the training images, we process each test image through the same neural network to obtain an embedding for each of them.

3. Calculating Cosine Distances:

  • Now, for each test image, we calculate the cosine distance between its embedding and all the embeddings of the training images that we computed and stored earlier.
  • The cosine distance is a measure of similarity between two embeddings, with lower values indicating greater similarity.

Cosine similarity of product images

4. Finding the Nearest Neighbor:

  • Among all the training image embeddings, we identify the one with the smallest cosine distance to the embedding of the test image. Essentially, we are finding the "nearest neighbor" in the training set for each test image.

5. Comparing Class Labels:

  • After finding the nearest neighbor, we check the class label associated with that training image.
  • We then compare this predicted class label with the actual class label of the test image.

6. Evaluating Correctness:

  • If the predicted class label matches the actual class label of the test image, we consider this a correct classification or ranking for that test image.
  • We repeat this process for all test images and keep track of how many times the predicted label matches the true label.

In summary, this evaluation method assesses the performance of a model by checking how well it can recognize and classify images in a test dataset based on the similarity of their embeddings to those of the training dataset. It measures accuracy by comparing the predicted labels to the true labels of the test images. This approach is often used in tasks like image retrieval or image classification to evaluate the quality of a model's representations and its ability to generalize to new data.

These embedding models can sometimes give the illusion of excellent performance if the same image is used both during the initial embedding calculation and in a subsequent reverse image search. In such cases, there would often be a 100% match, making it challenging to accurately assess the model's true capabilities.

Fortunately, the dataset we selected for evaluation has a balanced distribution of classes, approximately 2,388 classes in both the training and test sets. Moreover, on average, there are five different image samples available for each class. This balanced and diverse dataset helps us avoid the issue of overestimating the model's performance due to exact image matches during testing, allowing for a more reliable evaluation.

Our results with our new model

Our brand new model reached an accuracy of 89% in terms of Top-1 retrieval compared to the 41% & 50% above. This comes from breakthroughs in how we utilize the weights of the model and how we set up our hyperparameters. The ability to train this architecture without the need for large amounts of data augmentation and evaluate on a dataset that fits with the real world use case makes it easier for us to iterate this accuracy forward than with the other two models.

our results on product similarity vs others
Our results vs two most common models

Top-1 Accuracy
Top-1 Accuracy

The new model has a much deeper understanding of the products in a retail environment where the images are not clean and the quality is not always high. This is the point where most of these products fall apart if they don’t already have enough data to train on through real world image collection. This model gives us an elite starting point that only improves from there.

Interested in implementing product matching?

Width.ai builds custom product matching systems used in retail environments to recognize and match SKUs, products, and other items. We’ve reached 90%+ accuracy in a ton of domains and have scaled these systems up to 3.5 million SKUs. We’d love to chat about your product matching or warehouse automation use case!