PyTorch Image Classification for Product Recognition: SOTA Framework for SKU Recognition

Matt Payne
November 10, 2021

Visual focused automation in a business setting has exponentially increased in accuracy and speed with the growth of computer vision architectures. This has allowed manual focused industries to easily leverage computer vision through everyday devices such as phones and ceiling cameras without needing expensive hardware or vendor locked APIs. We’ve built a number of these systems for various use cases centered around how computer vision architectures built on pytorch can be leveraged to automate business processes.

One such use case is setting up an image recognition system with PyTorch that is able to recognize product SKUs, gather information from the classified image, and feed that information into an inventory management system, CMS or wherever it might be needed. Let’s talk a bit about performing image classification with pytorch and walk through an example use case. 

What is image classification?

image classification vs object detection

Image classification is the process of taking an image as input and outputting a class label that best describes the image. Depending on the label chosen during training, this can range in granularity from detecting the class of a product (beverage,meat,flour) in an image to detecting specific SKUs (7oz flank steak, Coke Zero 12oz, Coke Glass Bottle 7oz). The label can be as precise as the information we can give the model during training. 

Image classification used in single view multi object settings often requires a prior step of image segmentation or object recognition. The upstream model is used to recognize the instances of products, then the products are cropped and we perform image classification on the cropped product.

product recognition with image classification and image cropping
How we go from full shelf image in a retail setting to a cropped product image that we can classify based on SKU

Why use PyTorch for image classification?

DiagramDescription automatically generated
Some defining features of the PyTorch framework (Source: PyTorch Official)

PyTorch is an open source deep learning framework used for developing and training custom neural networks. It is widely used by researchers and developers due to its flexibility and ease of use. PyTorch also has a number of unique features that make it well suited for image classification tasks. We’ll dive into these themes in the product recognition walkthrough below.

  • PyTorch offers a number of pretrained models that can be used for image classification. These models have been trained on large datasets and are able to achieve state-of-the-art performance on many image classification tasks. PyTorch even offers a package called torchvision that offers data loaders for the most common image classification datasets. 

  • PyTorch offers a number of powerful tools for data augmentation and preprocessing. These tools can be used to automatically generate new training data from existing data, which is especially important for product classification tasks where it is often difficult to obtain a large dataset. Many of the popular SciKit learn preprocessing frameworks easily fit into PyTorch pipelines or can be found in this wrapper (skorch-dev). The same torchvision package above offers data transformers for images and videos. 

  • PyTorch offers a number of tools for visualizing and debugging neural networks. This is very important for product classification tasks where it is often necessary to understand what the model is actually doing in order to debug it or improve its performance.


  • PyTorch models are also incredibly easy to export to the ONNX framework for use in new environments and maximizing performance on various hardwares. 
ONNX framework
Exploring an image classifier exported from PyTorch to Onnx in Netron

Benefits of PyTorch over TensorFlow?

There are a few reasons why we prefer PyTorch over TensorFlow for this image classification system. The biggest reason is that PyTorch is much easier to use and debug than TensorFlow for custom architectures, which speeds up development time and reduces the number of iteration cycles required to reach production. Its higher flexibility allows much more customization in how the model works and lets us fit a model to the given task. This has also lead to PyTorch becoming the favored framework for research, with new projects and pre-trained models coming out regularly, which we can use to make better AI systems faster using the popular architectures built into PyTorch.

ChartDescription automatically generated
Graph of published paper percentages that use PyTorch vs TensorFlow (Source)

What popular architectures are included in PyTorch?

Some of the most popular architectures that are included in PyTorch are AlexNet, VGG, Inception, and ResNet. Each of these architectures has been pre-trained on large datasets and is able to achieve state-of-the-art performance on many image classification tasks. Depending on the individual needs, the models often need to be fine-tuned or customized for the specific application to get the best results.

Since PyTorch has become the gold standard in research in recent years, using it we can apply cutting edge techniques being developed and published by researchers around the world.

How can this be used for product recognition?

Product recognition systems are trained to detect and classify objects in images, just like any other image classification model. The main difference is in the type of data that is used to train the model. Product recognition models are trained on a dataset of product images, which can be gathered from online resources or taken from a company's product catalog. They can then be used to automate the process of sorting and organizing products, building a catalog, inventory products, or to track product information. Computer vision product recognition + text analysis can also be used to automatically generate targeted marketing content or product recommendations.

width.ai product recognition pipeline for retail
Here’s our other product recognition pipeline that uses text similarity alongside the computer vision tasks

The step-by-step process of going live with a product classification model

ai lifecycle
A typical lifecycle of an AI project (Source)

There are a few steps that need to be followed in order to deploy a product classification model. We’ll walk through each of these steps in detail in the next section. 

  • First, the data needs to be collected and labeled. Often, these labels are already available in the product database, but can also be derived in other ways, such as manual labeling. You’ll need either cropped product images from the environment or white background product images in most use cases. This process can also be sped up through use of OCR, which would read the labels of products, helping with labeling. Barcodes could also be used, though blurriness in hand-taken photos may prove an additional challenge.
cropped product example

  • Second, the data needs to be preprocessed and augmented. This is especially crucial when the products may appear under different lighting, arrangements, small variations or obstruction by other objects (occlusion). This step can often be done using the tools provided by PyTorch and combined with the torch.utils.data.Dataset and torch.utils.data.DataLoader modules during training.

  • Third, the model needs to be trained. This can be done using a fully custom model or leveraging a pretrained model. 

  • Fourth, the model needs to be tested and deployed. The model will be deployed in the cloud as a part of a REST API pipeline that allows you to interact with it through a mobile app, website, or software application. 

  • Finally, the model needs to be monitored and maintained. This includes regularly retraining the model on new data and monitoring the performance of the model in production.
Weights & Biases dashboard
Model monitoring in Weights & Biases

Building a product recognition pipeline with pytorch image classification 

We will go over these steps in use case example for food retail shelves with the goal of automating parts of the inventory management process. We want to be able to take a photo of a product shelf and see which products appear on it to automate the inventory taking process. We’ll focus on a two model architecture for finding local features in the full view image and classifying the exact product SKUs. 

example of a retail shelf
Example image of a retail shelf for product recognition 

Our use case has the additional constraint that we would like to use a single white background product images for training, as these images are generally available for most products, saving us the step of having to create images via cropping each product from full shelf images. 

Graphical user interfaceDescription automatically generated
A variety of image products as they may appear on a potential grocery retailer's website

The issues here normally come from the fact that these images are much cleaner than what you will have for each product on the full shelf. Product images are centered, straight facing, and much more clear than what you are left with after cropping. From experience we’ve seen that adding preprocessing to the training workflow is the only way to replicate the high level of noise you will have in the production setting. 

The effects of taking images of these shelves at an angle become magnified as products further away are more difficult to read, and become much less similar to the product images. 

This constraint will influence our training strategy and AI model choice. We settle on a network and training approach with single-shot capabilities, meaning we will not require many images of the same product from many angles to have the model recognize it. This sets our choice of model on a pipeline consisting of a feature extractor pretrained on ImageNet and a TransformNet architecture. The neural networks will perform separate tasks and feed into each other to learn to recognize products from a single image in many contexts. We can construct the networks using PyTorch's NeuralNetworks module, which allows for creating arbitrarily large neural networks with a few lines of code.

PyTorch CNN
Convolutional Neural Network Architecture in PyTorch for Image Classification

First: The Data

We need relevant data both for training the model and evaluating its performance. For our use case example, our relevant datasets would be a product database, such as the one used for the client's online store, or internal CMS with images of the products. 

product examples from our database
Each of these are a predicted class from our database

In our case, the target data of stocked shelves with products will need to be collected and labeled, how much data is needed depends largely on how much variety there is in the data (store decoration, different shelving types, how much the lighting in the store differs) and how well existing datasets and pre-trained models can be leveraged. Good planning and use of existing labeled datasets can save thousands of dollars in data labeling work. For the supermarket product recognition model, a public research database exists that we can use to build a proof-of-concept to see if our idea is feasible.

Using a training dataset with a high variation of views of the products is key to be able to operate in a production environment. Pictures of the retail shelves will never be perfectly straight on and the front side of products won’t always be easy to see. If you don’t include variations of the visible product your training accuracy will be much higher than that of your test set, and model optimizations alone will not greatly improve the results. 

bounding box recognition on the retail shelf

Labeling the data is often the most time-intensive part of any project, smart planning can help reduce time and cost

Second: Preprocessing & Data Science

The goal of the application also shapes the approach in pre-processing. Common techniques are changing the colors, size, angle and cropping of images to extract as much useful data from the training data as possible. PyTorch offers a variety of built in methods for augmenting the training image data such as flipping, offset, color shift and others through its torchvision.transforms module. 

data augmentation for increased accuracy
Data augmentation to increase noise in training images. Data augmentation greatly reduces the reduction of test accuracy vs training accuracy. 

For our retail product recognition tool, too aggressive data augmentation would be counter productive as products can be very similar to each other and small changes in color might actually make results worse. We settle on simple cropping and normalization, which is a process of aligning the image values around the mean of the training data in a way that is proven to boost neural network performance.

Third: Model Training & Parameters

It is often possible to save time and resources by using a pre-trained model and perform so called "fine-tuning", which tunes an existing model to a new specified purpose. As important is the choice of appropriate "objective function" to calculate the model loss, which will determine how the model evaluates its performance during training. 

Since our model will both classify and localize the product in the image, we require two loss functions, a localization loss measure that gives the location error and a prediction loss that measures classification error.

We use the nn.HingeEmbeddingLoss and nn.SmoothL1Loss from the PyTorch built in loss functions for classification and location respectively. The ReLU activation function is most commonly used as it adds non-linearity to our equation.

loss functions in PyTorch
SmoothL1 Loss Function

After training we see the model is very good at recognizing arbitrary amounts of the products on the shelf, giving us not just the product presence, but also its location with satisfying accuracy, which we quantify through exhaustive experiments.

Our workflow for product recognition
Visualizing the results of our model

Fourth: Deployment

Once we have a model that produces satisfying results, it needs to be made robustly accessible to wherever it will be eventually used, be that in an app on a mobile device or the background of a large scale server. 

Workflow architecture used for product classification
Workflow architecture

Deployment & cloud infrastructure when dealing with image inputs can be a bit more difficult than other machine learning domains. Oftentimes these input images can be quite large in size which can cause slowdown in batch scenarios. Building asynchronous architecture to handle taking in these images and processing them is pretty valuable here. Some managed deployment services don’t allow inputs larger than 5MB which can be a serious problem for our images. The common tradeoff people make is reducing the image quality. As we’ve seen above, this can create issues in high noise retail shelf environments. 

Finally: Testing and Maintenance

The testing and maintenance workflow for computer vision based systems is much more rigorous than other similar machine learning systems for a few reasons:

- Image inputs generally have more background noise and less relevant features than text based problems

- Data variance that is different from what is seen in the test environment grows quicker for computer vision problems without a domain change. Slight changes to angle, brightness, display sizes, and distance to products all affect the results. 

- Leveraging pre-training and models with massive agnostic training (like LLMs) doesn’t easily exist in computer vision.

A robust training and optimization iterative workflow that is well planned out from start to end allows any changes needed to be made seamlessly and improvements to the system to be facilitated as our understanding of how real users produce images. Regular testing, user experience gathering, and good model versioning can allow the models to be continually improved. The models can be automatically tested each time a new release is pushed out or changes to our training dataset are made. 

We often deploy training pipelines that build right into product databases to allow for quick and automated fine-tunes to our models. This data comes from real product use which greatly improves our data variance coverage in production. Reducing the friction required to deploy new versions of models is one of the best ways to reduce total development time, and increase the lifecycle of ai products.

big ai logo
Keeping up with advances in AI research is crucial to ensure the system can improve over the long term (Source)

Interested in building an image classification system?

Width.ai builds custom NLP & computer vision software (just like this product classification model!) for businesses to leverage internally or as a part of their production. Schedule a call today and let’s talk about how we can help you deploy custom image classification models, or any other computer vision system. 

width.ai logo