92.44% Product Similarity through fine-tuning CLIP Model + Custom Pipeline for Image Similarity

Matt Payne
April 15, 2023
product photo examples

Image similarity matching for use cases such as product matching is a popular use case in the realm of ecommerce SKU management, product recognition in retail settings, and seller onboarding. These are use cases that are pretty common in our PIM automation platform Pumice.ai as the input products are image heavy as opposed to product text. These use cases traditionally grow in difficulty as the number of SKUs and granularity of SKUs increases meaning the variations between SKU images becomes smaller. Anyone can tell the difference between a pair of shoes and a shirt, but what about the same shirt product with 15 SKUs?

We’ve implemented a custom product image similarity architecture that leverages a fine-tuned CLIP model + a custom supporting pipeline to reach 92.44% Top-K=1 accuracy & 99.3% Top-K=3 accuracy on a massive product image dataset. This pipeline focuses on finding similar SKUs between an input product image and a full database (10 million plus) products. We used images from all categories of the Google Product Taxonomy with a focus on apparel, tech, and home goods.

Post processed image examples
Post-processing examples of product images from the dataset

Short Literature Survey

Image similarity models are becoming increasingly important in various ecommerce applications such as image retrieval, image clustering, and product recommendation systems. In recent years, many image similarity models have been proposed, each with its own strengths and weaknesses. In this quick literature survey, we will review some of the most widely used and effective image similarity models.

  1. Siamese Networks: Siamese networks are a type of neural network architecture that have two identical sub-networks which share the same parameters. These networks are used for finding similarities between two images. The network learns to encode images into a feature space, and then computes a similarity score between the two images based on the distance between their feature vectors. Siamese networks have been widely used in image retrieval, image matching, and face recognition applications.

  1. Triplet Networks: Triplet networks are similar to Siamese networks but they compare three images instead of two. The network takes three images as input, an anchor image, a positive image (similar to the anchor image), and a negative image (dissimilar to the anchor image). The network learns to encode these images into a feature space such that the distance between the anchor and positive image is smaller than the distance between the anchor and negative image. Triplet networks have been widely used for face recognition and image retrieval tasks.

  1. Convolutional Neural Networks (CNNs): CNNs are a type of neural network architecture that are particularly well-suited for image processing tasks. They have been used extensively for image classification, object detection, and segmentation tasks. CNNs can also be used for image similarity tasks by using the feature vectors extracted from the final layer of the network to compute similarity scores between images.

  1. Generative Adversarial Networks (GANs): GANs are a type of neural network architecture that are used to generate new images that are similar to a given set of images. They consist of two sub-networks, a generator network that generates new images and a discriminator network that distinguishes between real and generated images. GANs can be used for image similarity tasks by training the generator network to produce images that are similar to a given set of images.

  1. Metric Learning Approaches: Metric learning approaches learn a distance metric that can be used to compare images. These approaches learn a function that maps images to a high-dimensional feature space such that the distance between images in the feature space corresponds to their similarity. Examples of metric learning approaches include contrastive loss, triplet loss, and quadruplet loss.

In conclusion, the above-mentioned image similarity models have been widely used and have shown promising results in various image-related tasks. The choice of model depends on the specific application and the data available. However, the recent advancements in deep learning have led to the development of more sophisticated models, and it is likely that further progress will be made in this area in the future.

What is CLIP?

CLIP Architecture
CLIP Architecture

CLIP (Contrastive Language-Image Pretraining) is a groundbreaking AI model developed by OpenAI that has significantly impacted the fields of computer vision and natural language processing. The model is designed to understand images and their corresponding textual descriptions, making it a powerful tool for various applications, such as image classification, object detection, and zero-shot learning.

At its core, CLIP is a multi-modal deep learning model that combines the strengths of both computer vision and natural language processing. It leverages a unique training approach called Contrastive Pretraining, which aims to maximize the similarity between image and text embeddings for corresponding pairs while minimizing the similarity for non-corresponding pairs. This technique enables the model to learn a rich and meaningful representation of images and their textual descriptions.

The architecture of CLIP consists of two main components: an Image Encoder and a Text Encoder. The Image Encoder can be a ResNet or a Vision Transformer, responsible for converting images into fixed-size feature vectors. On the other hand, the Text Encoder is a Transformer model with GPT-2-style modifications, responsible for converting textual descriptions into fixed-size feature vectors.

One of the most remarkable aspects of CLIP is its ability to perform decent zero-shot learning. Unlike traditional models that require training on specific classes or labels, CLIP can generalize to unseen labels. This is achieved by computing the cosine similarity between the image and text embeddings during the zero-shot classification phase. The text prompt with the highest similarity is chosen as the prediction, allowing CLIP to accurately recognize classes and objects it has never seen before.

The success of CLIP can be attributed to its massive training dataset, consisting of 400 million image-text pairs, and its efficient use of computational resources. The model's robustness to distribution shift and its ability to handle complex image patterns make it a valuable asset in various AI applications. Training the largest CLIP model, RN50x64, required significant compute power, taking 18 days to train on 592 NVIDIA V100 GPUs. The largest Vision Transformer model, ViT-L/14, took 12 days to train on 256 NVIDIA V100 GPUs. These models were trained using mixed-precision, gradient checkpointing, half-precision Adam statistics, and other techniques to optimize memory usage and accelerate the training process.

What is product similarity?

Product similarity or product matching is an embedding focused similarity task to understand the similarity between products either in a one to one or one to many (in a database) relationship. This is often used as a combination of image and text similarity models to fully understand the relationship between products. This allows you to compare products with very different names and attributes that should be labeled as the same underlying SKU.

two jacket example of difficult product matching

These two jackets offered by different marketplaces would be nearly impossible to compare for similarity with old school methods such as keyword extraction, NLP rules, or semantic similarity. Even though they’re the same product!

  1. They don’t offer the same colors which affects attribute comparison models
  2. Titles are very different
  3. Descriptions don’t exist to provide deeper context
  4. Prices listed are not close to each other

What we built: SOTA Image Similarity Pipeline

We built a custom SOTA image similarity pipeline that outperforms the base CLIP model on the task of matching an input product image to the same SKU image in a database of millions of products. We combined a fine-tuned CLIP model with an in-house custom feature understanding model to reach new highs in this domain.


evaluation results of our fine-tuned model

For evaluating the model we used the standard top-1,3,5 result accuracy metrics. This means we compared how often the correct match was the best match, in the top 3, or in the top 5. Additionally, we used a delta similarity metric to gauge the distribution of similarities between database images and product images. The above is how the training progresses as the model is trained. To compute the delta similarity, we first calculated the similarity between each input image and its corresponding product image. We then randomly selected additional product images and calculated the average similarity between the input image and those provided product images. We subtracted this average similarity from the similarity between the input image and its corresponding product image to obtain the delta similarity.

This allowed us to further analyze the distribution of similarities between input images and product images and gain insight into the performance of the model.

product dataset results after fine-tuning

Data Analysis

training and test datasets

A data analysis with t-SNE plot shows that product images are much more varied in nature than input images, and rightly so.  

t-sne plot of image features in product similarity

Fine-tuning & Model Optimization

CLIP uses a symmetric cross-entropy loss function as part of its contrastive learning approach. The model is trained to maximize the cosine similarity of the image and text embeddings of the real pairs in a batch while minimizing the cosine similarity of the embeddings of the incorrect pairings. The symmetric cross-entropy loss is computed over the similarity scores of the image and text embeddings.

We chose a unique & different loss function for this specific fine-tuning use case & optimized our hyperparameters to keep from overfitting to this dataset. As you can see from the above epoch iteration dataset we did a great job of keeping the model from overfitting once we had a set of hyperparameters we felt found the global minimum for our loss function.

This fine-tuned CLIP model alongside our custom featuring understanding model quickly learned how to compare products for underlying SKU similarity and left us with more room in the tank for future iterations to improve the accuracy further. This pipeline doesn’t even leverage our custom text similarity architecture, which currently reaches over 90% accuracy for product similarity use cases in Pumice.ai.

Want to implement a custom image similarity pipeline?

Width.ai builds custom NLP & computer vision software for businesses to leverage in use cases like the ones talked about above. Want to schedule a time to talk to us about how we can build something like this for you? Schedule a time to chat on our Contact Us page.