Improve Your Product Catalog Optimization with Ai in 7 Easy Steps
What are the key benefits of product catalog optimization and how you can remove 90% of the manual labor required with Ai
In recent years, nutrition and healthcare have gained much prominence in people’s lives. A large number of people around the world are suffering the long-term health outcomes of COVID, and doctors have been suggesting lifestyle and dietary approaches to alleviate those effects.
Elsewhere, greater awareness about public health has motivated people to change their food habits. But many people are concerned about the long-term nutritional effects of such changes on their health, wondering if they should add any supplements or more nutritious food to their diet.
Nutrition focused artificial intelligence models can help answer such questions and support people’s desires for healthier nutrition in their lives with less chronic disease. In this article, you will learn about the current and future applications of deep learning, machine learning algorithms, and AI in nutrition.
Food identification is usually the first step in most human nutrition applications. If you run a restaurant, backend food vendor, or a food related SaaS business, you need it to automatically report the nutrition related data of foods on display. If you’re a customer, you can use your smartphone to click a photo of a dish and have it identified before finding out its calories.
In this section, we explain some state-of-the-art food identification techniques based on the research paper, MyFood: A Food Segmentation and Classification System to Aid Nutritional Monitoring. By understanding these techniques, you can adapt them for your particular use case, cuisine, or ingredients.
Most identification problems tend to be classification tasks that report a single label for an entire photo. But food photos in the real world are too complex for a single label. A photo may contain multiple foods, plates, cutlery, tables, and other scene objects. Domain-specific problems like high inter-class similarity — for example, between different types of coffee or cheeseburger variations (cheeseburger vs vegan burger vs bison burger) — and general problems like occlusion are common. A good system should be capable of overcoming them, isolating just the food items accurately, and identifying them.
A more difficult task is to identify all the ingredients in the food. For example, it may not be enough to identify the main food in a photo as a pizza but more specifically as a pizza containing salami, pepperoni, mozzarella, and tomato sauce. Every pixel in the photo may belong to a different ingredient, vegetable, or meat.
That level of fine-grained detail requires labeling every pixel with a food or ingredient name. The computer vision term for classifying every pixel that way is “segmentation.” Food identification is a problem where segmentation is the most reliable and informative approach. In the next section, we’ll explore some neural network architectures for segmentation and see how we can improve dietary assessment.
Many deep neural architectures for segmentation have been proposed over the years. The basic problem they all have to solve is to scale up the convolutional filter maps back to the input image’s dimensions so that every pixel in the image can get a label. They differ in their upsampling approaches, layers, training performance, and other aspects. Let’s survey five popular architectures.
A fully convolutional network (FCN) consists of only convolutional layers. Unlike a typical classification network, it does not have a fully connected layer to combine convolutional filter maps into a 1-dimensional vector before classification. Instead, even the final layers consist of a special type of convolutional filter that scales the filter maps back up to the input image’s dimensions and generates a label for every pixel in the process.
SegNet follows an encoder-decoder architecture that is common to many segmentation networks. The encoder section consists of convolutional and pooling layers to successively downsample the input image. The decoder section consists of convolutional layers to upsample it back to the image’s dimensions. Where it differs from other networks is in using the idea of pooling indices to reverse the effects of pooling operations, making it lighter on processing and memory than other networks.
ENet is a variant of ResNet that uses residual connections between all layers to improve the flow of features across all layers. Enet’s specialty lies in being extremely lightweight — it’s light enough to run on smartphones, making it an excellent model for mobile apps.
DeepLabv3 is a slightly older architecture that uses concepts like atrous convolutions for the upsampling step.
Mask R-CNN consists of two networks — a Faster R-CNN object detection network followed by an FCN.
The paper compared these five architectures on five metrics:
They found that the two FCN-based models reported the best values, significantly outperforming SegNet, ENet, and DeepLabv3. However, remember that some architectures — like ENet — deliberately sacrifice accuracy to efficiently run on resource-constrained devices like smartphones.
The image above shows the segmentation results of the five models on example food items. Notice how all of them were able to isolate the food from containers and other non-food objects.
However, food identification is usually just the first step. In the next section, we examine how it’s used to get important nutritional information.
Many people try to be very conscious of their calorie and nutrient intake for various reasons — dietary, fitness, weight loss, health care, physical activity, or just general well-being. People monitor their intake values based on the daily values and serving sizes printed on food labels. Ideally, their intake levels should conform to the metrics defined by medical bodies, like the dietary reference intake, the recommended dietary allowance (RDA), or the adequate intake.
However, every person has different ideas of what a portion of food should be. Nutrition research has shown that most people are not good at estimating nutritional information accurately from food photos. A smart system that can accurately estimate the calories and nutrition in any quantity of food can be very helpful for health and fitness. It can act as a nutritionist, recommending personalized meal plans.
In this section, you’ll learn about a technique to do this by combining computer vision and natural language processing. It’s based on the paper Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information.
A nutrition database contains authentic calorie, macronutrient, micronutrient, and other types of nutritional information about thousands of ingredients and food items in a structured format. These nutritional values must be measured on standardized, and not ad-hoc, quantities of food.
The research team prepared this database from three sources:
The text information in these databases, like the names of ingredients and food items, are converted to vectorized embeddings using Google’s universal sentence encoder so that they can be used for semantic searching later on. Semantic searching means matching text based on its meaning rather than keywords.
The next step is to prepare a recipe database where each recipe is accompanied by real-world photos of the food items along with detailed ingredient lists and quantities. The research team prepared this by processing information on recipe websites. But public datasets like Recipe1M+ can also be useful here and are popular in artificial intelligence.
A typical recipe will list the ingredients and quantities it uses. Quantities are usually expressed in units like grams, ounces, milliliters, or pieces. The goal in this step is to accurately estimate the nutritional values for each ingredient, given its quantity in the recipe and the nutritional values in standard quantities mentioned in the nutrition database of step 1.
First, ingredient names are converted to vectorized embeddings using the same universal sentence encoder. This enables semantically matching them to the ingredient names in the nutrition database. Semantic matching is necessary because different people may use slightly different names, spellings, or phrases for the same item. For example, a recipe may use the word “eggplant” while the database uses the word “aubergine.”
Quantities mentioned in the recipes are extracted using NLP based information extraction techniques. If an ingredient’s quantity cannot be reliably extracted, it’s excluded from the process to keep the dataset accurate.
At this point, given a recipe name, the system can report its list of ingredients, their quantities in the recipe, and the nutritional information for standard sizes of that ingredient. But so far it’s just a text retrieval system that uses natural language processing.
In this step, the system is enhanced with image recognition capabilities. The goal is that when it sees a real-world photo of a food portion, it’s able to visually match it to one of the items in the recipe database.
The research team trained and evaluated two different deep neural networks:
Both networks were pre-trained on the ImageNet database and then fine-tuned on the real-world food photos from the recipe database. Both are image classification networks — given a photo, they output a recipe name based on how similar it is to a photo in the recipe database.
But the end goal here is not food identification. Instead, it is to estimate the following nutritional information for the food portion shown in the photo:
Predicting the number of calories and other nutrient values is a regression task. Listing the ingredients in the food item is a multi-label classification task. Overall, this is a multi-task learning problem.
One approach to multi-task learning is to train independent networks for each task. But that approach fails to leverage information from the other tasks. So the research team used an end-to-end training approach where the same network executes all the tasks.
In this case, the food identification neural network’s final classification layer is replaced with a regression layer and a softmax layer. The regression layer consists of five neurons to estimate portion size, calories, weights of proteins, carbohydrates, and fats in the food portion. The softmax layer consists of 100 softmax neurons to output ingredient labels.
The total loss function for the full network consists of L1 losses for the four regression neurons and binary cross-entropy losses for the 100 ingredient neurons.
For inference, given a photo, it outputs the following:
The estimation of calories and nutrition using regression works like this. During the end-to-end training phase, the food photos, portion sizes reported in recipes, calories per unit mass, and nutrition per unit mass from the nutrition database form the inputs to the model.
The convolutional layers of the network produce filter maps that embed visual features about the food items, like colors, textures, color and texture transitions, size of food relative to other items in the scene like plates, and so on. For each image, all the filter maps are linearized to a single vector embedding that becomes the input to the regression layer.
The regression layer learns to associate the visual features embedded in these vectors with the food’s per-unit-mass calorie and nutrition values. If the training images include a wide range of portion sizes and food items, the ability of the model to generalize becomes better. The model essentially predicts accurate calorie and nutrition values by eyeballing the different features of the food in a photo.
The image shows some sample outputs produced by the model along with nutritional estimates.
The six steps in this approach are easily customizable for your use cases. Below are some potential enhancement ideas made possible by advances in deep learning:
Greater awareness in nutrition science and progress in genetics, improving health outcomes, and artificial intelligence have enabled advances like the identification of food ingredients that act like disease-fighting drugs. It’s now possible to automate personalized nutrition plans and dietary assessments that are based on the patient’s genetic profile and medical history. Smartphones can provide instant personalized information on nutrition and health based on photos or medical sensor readings.
At Width.ai, we have years of technical and domain expertise working on smart health systems that use the latest deep learning and machine learning algorithms to extract insights from photos, medical reports, and other complex real-world data. We have built highly efficient food recognition systems with 98.57% accuracy.
Get in touch to see how we can build custom artificial intelligence models for the nutrition and human health domain.
Robin Ruede, Verena Heusser, Lukas Frank, Alina Roitberg, Monica Haurilet, Rainer Stiefelhagen. “Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information” arXiv:2011.01082 [cs.CV] https://arxiv.org/abs/2011.01082
What are the key benefits of product catalog optimization and how you can remove 90% of the manual labor required with Ai
The key components to building a gpt-3 summarizer with short & long-form summarization for news articles, blog posts, legal documents, and more.
Here's 5 of the most valuable ways to convert unstructured text to structured data with natural language processing
7 different ways to extract valuable information from unstructured text using algorithms such as GPT-3, spaCy, and LDA.
Understand how to extract text from images via Python without Tesseract and how we execute robust text extraction and document understanding for your business.