Drive More Conversions: State of the Art Visual Search In Ecommerce | Building Visual Search (Ecommerce Focused)

A complete walkthrough guide on how to use visual search in ecommerce stores to create more sales and real examples of companies already using it.
ecommerce visual search

Humans tend to be very visual thinkers and are very good at identifying images, and there are many instances where it's much easier to provide a product image instead of explaining it in a search. The above image could be explained in many different ways which in turn makes it harder for a search engine to identify the exact product, and is much easier to add a close image. Visual search allows customers to use images to search for information rather than typing words.

Visual search functionality presents users with products having relevant visual attributes to a provided image, and can even be used alongside standard text search to include brand names, product features, and other key parts of standard ecommerce search. 

global visual search market
 Global visual search market (Data bridge market research)

What is visual search in ecommerce?

Let's go ahead with an example and try to find a product match for the below pair of shoes without a description.

yeezy shoe
Image source

Unless you are already familiar with the brand or product line these shoes come from this could be a relatively difficult search. These Adidas Yeezys do not have any branding on the outside and do not have a recognizable logo. 

The difficulty in searching for these shoes could result in losing a customer. Finding these shoes in the ecommerce store could take multiple different search queries and time spent off site researching the shoe. With the proven research that 47% of users give up searching for a product after just one attempt, and only 23% try 3 or more searches, it’s crucial users have an optimized experience of finding the product they want.

This is where visual search can help. With a query image, the customer can find all the available options, offers, and even colors much faster with a simple image upload. This allows the customer to shop without needing to figure out the best keyword, phrase, or description of the product that finds the best match according to them.

Ecommerce Visual search isn’t just limited to exact product matches. In the example below we find products with features and attributes that have a high product similarity to our search image. This learned relationship is extremely powerful and provides us the ability to make changes to how we compare product images for similarity.


query with google
Image source

In short, Visual search uses the input product image from the user and displays related content. It leverages machine learning and various other data sources to determine the content and context of the input image to display related content.

Why is visual search important?

Visual search is a great way to engage potential customers in a new way and provide them with more options to reach their goal products in the least amount of time. As the visual search space continues to grow in size, a market that is estimated to be more than $14m by 2023, it will soon feel like a non-negotiable to offer this new way of searching stores. Brands that do not offer this search as a way to drive higher conversion rates will find it tough to exist in the growing mobile app driven ecommerce marketplace.

  • Shortening the path to purchase
  • Engagement in terms of product discovery, color, and choice.
  • Keep customers in your store without them having to leave to research.
  • Customers can find more information and description about the product.
  • Serve the customer with complementary products, and give multiple suggestions.

Visual search engines eliminate the friction of buying decisions by quickly providing a solution with multiple options to decide. This acts as a stepping-stone to Omnichannel experience.

Real-time Customer Insights

Visual search helps companies in getting real-time meaningful market demand signals. It shortens the gap between need and demand. 

The data streams help in acquiring unprecedented insights about upcoming trends from fashion to features, thus saving time and improving efficiency.

As per reports each month there are more than 600 million visual searches are done just on Pinterest. Studies show that Pinterest ads enjoy an 8.5% conversion rate and this is expected to continue to grow.

Valuable Metrics Used To Track Visual Search Success

  • Conversion rate
  • Click through rate (CTR)
  • Basket size
  • Product view numbers
  • Bounce and exit rate

Conversion Rate

Conversion rates of site viewers see a huge boost as the number of ways a potential customer can reach their target goal increases. As we said before, a huge percentage of users drop off after just 1 search so it's vital we can help them in any way possible to reach the product they are looking for to convert.

conversion rates
Conversion rate

Click through rate(CTR)

Not knowing a category or product name would limit the customer from searching which would in-turn affect the ecommerce websites from getting the metrics. Click through rate is the number of  clicks on a specific link, so finding the specific link of a product or category is crucial for any ecommerce site. Visual search helps solve this issue when the customer has little information to describe what they’re looking for and the only has an image of the product.

CTR = (click-throughs / impressions) x 100

Click through rates see a boost for the same reason as conversions. Users find the exact product they are looking for with much less friction, especially in cases where their understanding of how to reach the product is low. The Yeezy shoe example above is a great way to show how hard it can be to find what you’re looking for based on text search alone, but how easy it becomes with an image.

Basket Size

Finding the exact products a potential customer is looking for is proven to increase the average basket size. Users are less likely to decide to give up on looking for a specific product when they already have a similar image and have an idea of what they want. Finding the right product keeps the customer motivated and would even help in stretching their budgets. Nearly 60% more Gen Z’ers have bought something after randomly seeing an item they liked, and visual search technology speeds this up. Young shoppers (about 62%) want visual search more than any other technology in ecommerce.

Bounce and exit rate 

Bounce rate is a pretty good metric for understanding how relevant a search result is to the user. Almost anything that improves a search engine's ability to understand the intent of the user will improve the bounce rate.

bounce rate

Potential of visual search

Online stores are in a constant battle of keeping the attention of new and existing customers, and taking advantage of them actually reaching your site with the intent to purchase is huge.

As customer demographics continue to shift towards buying online and quick buying decisions, visual search technology is exploding. The rise of apps such as Instagram and Tik-Tok being integrated with shopping interfaces means the channel from targeted user to checkout is smaller than ever. As you know it’s incredibly difficult to find product descriptions or information on these apps other than just a product photo making visual search technology even more valuable. As per Accenture, social media influences more than 37% of purchase decisions. 

A PwC survey found that 37% of consumers worldwide already use mobile devices to pay for purchases, about 44% for product research, and 38% for comparing products.

intent lab info graphic
The Intent Lab

visenze 2018

It can be hard for people to discover new products and/or brands as per the results of searches on Pinterest the majority of them are about unbranded products.

pinterest study


When customers can make searches and instantly find products without even a description, impulse buying from a social media post becomes much easier.

impulse buying stats

New studies by Gartner estimate that brands that affix visual search to their platforms see a potential rise in revenue by about 30%.

visual search usage by category
Visual Search (Ecommerce focused)

Challenges to overcome

  • Customer Behavior - Customers are not accustomed to using image search so there can be a learning gap. Users have to become accustomed to the shorter product path on an ecommerce site.
  • Targeted Ads - Targeted customer ads can be challenging to streamline as there are no specific keywords to cluster the customers.

Width.ai Technical Visual Search Principle

Content-based image retrieval has received good attention from researchers in recent years due to the rise of the relevance of online photos in social media and search engines.

Subsequently, ecommerce online shopping sites started to make use of visual search due to the advantages provided by it.

  • Convenient Interaction
  • Connection between offline and online
  • Easy searches

Image Search

Here let's see how we used image vectors on various models to improve the accuracy of visual search. 

Find all images close to the input image from the database.

upload image to search with

The database has product images and our goal is to find shoes that are similar to the input image Within seconds, our architecture searches and finds 30 product images. The model did a great job even in detecting the logo of the brand and producing similar results.

results of search

This model learns how to recognize and extract features from images. This allows us to deploy the architecture on any image dataset without needing to train on any specific data.

Our architecture even allows you to use text to search for product images. Here the search query is passed in and the product database is searched for similar images to what is described.

red hair search query
visual search principle

In the above example the Statue of Liberty is given as a query to retrieve multiple relevant images from the database. To retrieve similar images a pixel-to-pixel comparison will not work, so the only way is to represent the input image with a visual fingerprint or visual signature that captures all the information we need to do the retrieval task. The visual fingerprint usually takes the form of a vector or multiple vectors that capture all the relevant information. After getting the correct vector representation for the task, simply do the same for the images in the database.

Once we have all the vector representations we compare the query with the representation of all the images and check which ones are relevant and display the relevant images as output.

Initially, the image retrieval task had few limitations:

  • Training Data – Generalizing a representation under one category like in classification (e.g. ImageNet). Such representation is not very good as it does not help in discriminating different instances of the category.
  • Earlier approaches used smaller images which makes the details disappear; new retrievals would require high-resolution images to keep the important details.

Previous Image retrieval methods can be categorized into four types:

Real Examples Of Visual Search for Ecommerce

Understanding customer behavior

online and offline worlds
Online and offline worlds - Alibaba
online and offline worlds for search
Difference between searches: Text Search —> Search for discovery, Precise Search —> Image search

Customer pain points

how do you find the right keywords?

In recent years visual search has played a major part in ecommerce companies that are forward thinking with artificial intelligence. Let's see how Alibaba and Pinterest currently benefit from visual search.

Visual Search Engine at Pinterest 

Pinterest has over 250 million users that visit the site to see cool examples of fashion, food recipes, home decor, travel, and more from a content corpus of billions of Pins.

A chair recognized in a Pinterest post


What Pinterest found was most people care just about one item or product in an image and would like to get information specific to that item, not the rest of the image.

image with an object to grab
Queries often contain multiple objects, products and interesting visual regions

pinterest image with multiple products


  • Pinterest solves the challenges by finding the most relevant object in the image. They focus on Object-to-Object matching using customer input with a backbone made up of object detection algorithms. The goal is to allow the customer to crop using object detection models and then do a visual search on the extracted product. The results would look something like the below image.

Similar pipeline to Google Lens - Pinterest

Determining Visual Similarity

If just two images are present, the images are sent through the deep learning architecture that contains multiple layers. Each layer has a different representation of the image that is used to understand the images at a deeper level. The visual feature vectors are extracted for both the images and the similarity is computed.

image similarity

Object to scene visual search tool

pipeline for finding products in an image

To get the whole image instead of just the object image the objects are extracted from the whole image using an object detector, and an object index is built from which visually similar objects are found and the whole image is shown as output.

Let's discuss the architecture of one of the latest papers by Pinterest: Learning a Unified Embedding for Visual Search at Pinterest - (https://arxiv.org/pdf/1908.01707.pdf)

This paper focuses on three types of search products:

·       Lens

·       Flashlight

·       Shop-the-look

model architecture for pinterest
Model Architecture

The above figure is the architecture of a multi-task learning network. This proposed classification network as proxy based metric learning is both flexible and simple for multi-task learning. The proposed architecture also has a binarization module which makes the embedding memory efficient, and the subsampling module supports a large number of classes.

A common base network is shared between all the tasks until the generation of an embedding. Once the embedding is generated these are split-off into separate task specific branches.

Task-specific branches are fully connected layers with weights as class proxies and a softmax cross-entropy loss. As said before,  subsampling and binarization modules help in scalability and reducing storage cost.

The training process uses three unique datasets that contain a wide range of variations.

Flashlight dataset – 800k images, 15k semantic classes.

Lens dataset - 540K images across 2K semantic classes

Shop-The-Look dataset - 340K images across 189 semantic classes, and 50K instance product labels.

Evaluation of the results

model results

The new architecture was shown to outperform the ImageNet baseline and previous embedding work.

Human judgement based evaluation using various questions asked performed with good accuracy as well. Users were evaluated based on if unified embeddings or existing specialized embeddings were more relevant.

results from pinterest
product search results
The difference in search results in comparison to previous methods 

From the high quality results and solid performance metrics, it's clear the method laid out by Pinterest reduces storage and serving costs while maintaining accuracy.

Pinterest Visual Search Statistics

  • Pinterest visual search YoY  growth is about 140%. Business Insider
  • Pinterest had the largest increase in conversion rates between the large to small shopping carts. Heap Analytics
  • A Raymond James study found that there would be 21% drop in tradition search if  visual search is given as an option at Pinterest . 
  • There are over 600 million visual searches on Pinterest every month.

Visual Search Tools at Alibaba

alibaba image search

Alibaba uses both online and offline search for their visual search engine. The dual nature of the product took off instantly with customers.

online and offline visual search with alibaba
Image Search Workflow

These are parts of the offline processes that are used throughout the entire process of building indexes.

  • Item selection
  • Indexing construction
  • Offline feature extraction

The Online process is completed after the offline execution process and as online inventory gets updated every day. The Online process mostly refers to the key steps to obtain the result of the return process when the user uploads a query image.

The process is similar to the offline process and includes:

  • Category prediction
  • Feature extraction
  • Online detection

Eventually, the result lists are retrieved by indexing and re-ranking.

Model Architecture

The model is trained for a vast number of product categories and image variance. To improve the accuracy of predicting categories the architecture uses a weighted mix of search based and model based results. 

For the model-based part, they have deployed the GoogLeNet V1 (https://arxiv.org/abs/1409.4842) network. For the search based part they make use of the discriminative capacity of output features from deep networks. Binary search engines are used to retrieve the top 30 results in a reference set, then weight the contribution 𝑦𝑖 of each 𝑥𝑖 in 30 neighbors to predict the label 𝑦 of query 𝑥.

weight function
Weight function

Joint Detection and Feature Learning

The main challenge in image retrieval is the difference between the quality of customer and seller images. Seller images are professionally taken high quality images that are oftentimes cleaned up and taken with a noise free background. Images taken by customers have no limitations to background noise or image quality. A product image dataset that does not take the large data variance into account can often struggle on these new images.

results with alibaba

To address this issue they have proposed a deep CNN model with branches based on deep metric learning to learn detection and feature representations simultaneously.

deep joint model alibaba

A deep joint model is used to avoid huge time and bounding box annotation costs. The model jointly optimizes the detection and feature learning with two branches as shown in the picture above. In each deep joint model the detection mask 𝑀(𝑥, 𝑦) can be represented by a step function for bounding box approximation.

Rectangle coordinates (𝑥𝑙, 𝑥𝑟, 𝑦𝑡, 𝑦𝑏)
Rectangle coordinates (𝑥𝑙, 𝑥𝑟, 𝑦𝑡, 𝑦𝑏)

The Overall deep ranking framework is shown below:

deep ranking framework alibaba

Image Indexing and Retrieval 

Given the fact that the application is used by tens of millions of users each day, accurate real-time abilities and application stability are incredibly important for our visual search engine. 

multishard network

Multi-shards: An index instance is often hard to store in a machine with respect to scalability and memory. Usually you use multiple machines to store with each shard storing only a subset of the total vectors. A subset of 𝐾 nearest neighbors is found from shards.

Multi-replications: The Query per second (qps) is too high for such an app so they equip a multi-replications mechanism. Suppose there are Q queries visiting our system at the same time, they divide these queries into R parts, each part having Q/R queries. Each query part separately requests an index cluster. With this method the number of queries that an index cluster needs to process at one-time decreases from Q to Q/R.

Extensive experiments on High Recall Set illustrate the promising performance of Pailitao’s modules.

high recall setup for visual search

Alibaba Visual Search Statistics

alibaba visual search stats

Width.ai Visual Search Development

The ROI you see when you incorporate visual search engines into your ecommerce store could not be easier to see through the success at a high level of these large corporations. The growth of visual and voice search continues to move up and modern ecommerce companies are doing anything they can to help online shoppers reach the exact product they have in mind through any channel, and as social media apps continue to become a staple in how potential customers find products that interest them, visual search will only make it easier for you to convert them.

Width.ai builds custom computer vision and natural language processing software products for the ecommerce industry. We specialize in building visual search engines for any size ecommerce company and can be easily integrated into existing text based search tools. Contact us today to learn more about our ecommerce solutions.

width.ai machine learning consulting logo