How we Improve Product Similarity Search with Fashion CLIP over Traditional CLIP

Nitin Rai
August 4, 2023

In the rapidly evolving world of e-commerce and online shopping, product similarity search has become a crucial component for enhancing user experience and driving sales.

Traditional methods of image-based search have served us well, but with the advent of cutting-edge AI models like Fashion CLIP, there is a new frontier in product similarity search that promises to revolutionize how we find and discover products.

In this blog post, we’ll explore the differences between traditional CLIP and FashionCLIP and how we’ve implemented this domain-specific CLIP model to improve our product similarity search architecture.

Where is Product Similarity Search Used?

Width.ai product recognition pipeline
Width.ai Product Recognition Pipeline

Product similarity search is used as one of the key pieces of a product recognition pipeline. Once we’ve recognized that a product (or multiple on a retail shelf) exists in an image or video frame via a detection network, we crop the product out to prepare it for search. From there we can compare this cropped product image to a database of product images to find the most similar SKU.

This is a much better approach than training just a class-based object detection model as it requires less training data and scales much easier.

We’ve scaled this system in customer use cases to 3.2 million unique SKUs with even just one example image per SKU. Product similarity search is the piece that allows us to perform this comparison against the database.

Understanding the Foundations: CLIP

  1. CLIP is an AI model developed by OpenAI that enables cross-modal understanding between images and text. It learns to map images and their corresponding descriptions into a shared embedding space. This operation puts our data in the same vector format as used in NLP based systems.
  2. Efficient Image-Text Retrieval: CLIP allows for efficient image-text retrieval by encoding images and text into a common vector space. However, its generic nature may lead to suboptimal performance in domain-specific tasks like product search.

Limitations in Recognizing Products

Retail Product recognition is a nuanced domain with complex attributes, including styles, colors, and patterns on product packages. Traditional CLIP may struggle to capture these fine-grained details, leading to less relevant search results for product items. This can lead to search results that are not very relevant to what we want to match against.

Finding the right product can be tricky. For example, if we search for a "blue floral dress" the traditional CLIP might recognize the color "blue" and the word "dress," but it may not understand the crucial part about the "floral pattern." So, we might see dresses that are blue but don't have the beautiful floral design we want.

CLIP Model: blue floral dress search results
CLIP Model: blue floral dress

FashionCLIP Model: blue floral dress search results
FashionCLIP Model: blue floral dress

Introducing Fashion CLIP

  1. A Purpose-Built Model: Fashion CLIP is tailored specifically for the product related domain, taking into account attributes like product styles, colors, textures, and patterns. This specialization enables it to understand product items better.
  2. Fine-Tuning on Product Datasets: To improve its performance in the product related domain, Fashion CLIP is fine-tuned on large-scale (image, text) pairs obtained from the Farfecth dataset This fine-tuning allows the model to learn the specific visual and textual features relevant to product items.
latent space for clothing in CLIP and FashionClip

FashionCLIP aligns images and text in its vector space during training, improving their coherence and enabling accurate image-text matching in the product discovery domain.

  1. Leveraging Pretraining for Better Embeddings: The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained, starting from a pre-trained checkpoint, to maximize the similarity of (image, text) pairs via a contrastive loss on a fashion dataset containing 800K products more meaningful and contextually rich embeddings for fashion products, leading to improved similarity search results.
Schematic overview of multi-modal retrieval (left) and zero-shot classification tasks (right).
Schematic overview of multi-modal retrieval (left) and zero-shot classification tasks (right).

In our study, we compare FashionCLIP and CLIP on fashion-related tasks using different datasets. Specifically, we focus on Zero-Shot Classification and assess the models' performance on out-of-distribution datasets (KAGL, DEEP, and FMNIST).

FashionCLIP exhibits a noticeable performance boost over CLIP in these scenarios. Detailed experimental setups and results can be found in the accompanying paper, ensuring transparency and reliability in our findings.

Fashionclip vs CLIP
FashionCLIP shows significant improvement over CLIP in weighted macro F1 scores on out-of-domain datasets.

fashionclip vs clip in many different categories

F-CLIP outperforms CLIP in T-SNE with a higher silhouette score (0.115 vs 0.0745). It shows better fashion clustering with denser, non-overlapping categories.

Advantages of Fashion CLIP in Product Similarity Search

  1. Enhanced Image-Text Alignment: Fashion CLIP's domain-specific knowledge results in better alignment between product images and textual descriptions, resulting in more accurate and relevant search results.
  2. Improved Query Understanding: Fashion CLIP comprehends fashion-related language, including fashion terminology, brand names, and item attributes, allowing for more precise and targeted search queries.
  3. Addressing Ambiguity: Fashion CLIP can handle ambiguous search queries effectively, reducing the risk of irrelevant results and ensuring a seamless user experience.

How We’ve Used Fashion CLIP

example gif of on shelf search and product recognition
  1. Fashion E-Commerce Integration: By integrating Fashion CLIP into online retail products marketplaces, businesses can provide customers with an enhanced product discovery experience, leading to increased engagement and conversions.
  2. Personalized Styling and Recommendations: The improved product similarity search with Fashion CLIP enables personalized styling and recommendation systems that match customers with products tailored to their unique preferences.
product comparison on the shelf

Another great thing about embeddings calculated using FashionCLIP is it takes into consideration the semantic meaning of the image, for example:

On the left is the query image and on the right is the closest matching image in our index, even though they are different products but the text DALIA is common in between them.

In the first step of recognizing products from shelf images, we utilize our product recognition architecture for product detection.

This real-time and accurate deep learning model efficiently identifies and localizes multiple products within cluttered shelf environments. Its strong performance sets the foundation for subsequent stages in the product recognition pipeline.

product recognition on the shelf

We built an image index using FashionCLIP embeddings and a database for our product images, below is an example of how Top K matching images look like for a query image.

image index with fashion clip

Future Prospects

  1. Potential for Growth: With ongoing research and development, Fashion CLIP holds the potential for further improvements and expansion into related domains beyond fashion/products, contributing to a more advanced and seamless online shopping experience.
  2. Conclusion: Fashion CLIP's unique capabilities have the potential to reshape the fashion industry's approach to product similarity search, unlocking new possibilities for businesses and providing customers with a more enjoyable and efficient shopping journey.

Limitations, Bias, and Fairness in FashionCLIP

  1. Data-Driven Assumptions: The fashion data used in training FashionCLIP may contain explicit assumptions, like associating certain clothing attributes with specific gender identities. This can impact the model's understanding and representation of fashion items, potentially leading to biased results.
  2. Textual Modality: Due to the prevalence of long captions in the Farfetch dataset, FashionCLIP may perform better with longer queries than shorter ones. This characteristic should be considered when designing search queries to ensure optimal results.
  3. Image Modality: FashionCLIP demonstrates a bias towards standard product images with centered subjects and white backgrounds. This bias may affect its performance when presented with more diverse or unconventional fashion images.

Interested in implementing FashionCLIP or Product Recognition?

Let's chat about how we can implement these exact pipelines for product search or retail shelf product recognition. Contact us today!