Blog Posts

AI in Nutrition: How Technology Is Transforming What We Eat

Karthik Shiraly
September 7, 2021
Learn about modern applications of AI in nutrition, from food identification using deep neural network segmentation to estimating calories and nutrition. 
ai in nutrition outline

In recent years, nutrition and healthcare have gained much prominence in people’s lives. A large number of people around the world are suffering the long-term health outcomes of COVID, and doctors have been suggesting lifestyle and dietary approaches to alleviate those effects.

Elsewhere, greater awareness about public health has motivated people to change their food habits. But many people are concerned about the long-term nutritional effects of such changes on their health, wondering if they should add any supplements or more nutritious food to their diet.

Nutrition focused artificial intelligence models can help answer such questions and support people’s desires for healthier nutrition in their lives with less chronic disease. In this article, you will learn about the current and future applications of deep learning, machine learning algorithms, and AI in nutrition.

Food Identification Using Deep Neural Network Segmentation

person ordering food online

Food identification is usually the first step in most human nutrition applications. If you run a restaurant, backend food vendor, or a food related SaaS business, you need it to automatically report the nutrition related data of foods on display. If you’re a customer, you can use your smartphone to click a photo of a dish and have it identified before finding out its calories.

In this section, we explain some state-of-the-art food identification techniques based on the research paper, MyFood: A Food Segmentation and Classification System to Aid Nutritional Monitoring. By understanding these techniques, you can adapt them for your particular use case, cuisine, or ingredients.

Segmentation for Food Identification With Artificial Intelligence

Most identification problems tend to be classification tasks that report a single label for an entire photo. But food photos in the real world are too complex for a single label. A photo may contain multiple foods, plates, cutlery, tables, and other scene objects. Domain-specific problems like high inter-class similarity — for example, between different types of coffee or cheeseburger variations (cheeseburger vs vegan burger vs bison burger) — and general problems like occlusion are common. A good system should be capable of overcoming them, isolating just the food items accurately, and identifying them.

A more difficult task is to identify all the ingredients in the food. For example, it may not be enough to identify the main food in a photo as a pizza but more specifically as a pizza containing salami, pepperoni, mozzarella, and tomato sauce. Every pixel in the photo may belong to a different ingredient, vegetable, or meat.

That level of fine-grained detail requires labeling every pixel with a food or ingredient name. The computer vision term for classifying every pixel that way is “segmentation.” Food identification is a problem where segmentation is the most reliable and informative approach. In the next section, we’ll explore some neural network architectures for segmentation and see how we can improve dietary assessment.

Segmentation Deep Neural Architectures

Many deep neural architectures for segmentation have been proposed over the years. The basic problem they all have to solve is to scale up the convolutional filter maps back to the input image’s dimensions so that every pixel in the image can get a label. They differ in their upsampling approaches, layers, training performance, and other aspects. Let’s survey five popular architectures.

1. Fully Convolutional Network

A fully convolutional network (FCN) consists of only convolutional layers. Unlike a typical classification network, it does not have a fully connected layer to combine convolutional filter maps into a 1-dimensional vector before classification. Instead, even the final layers consist of a special type of convolutional filter that scales the filter maps back up to the input image’s dimensions and generates a label for every pixel in the process.

2. SegNet

SegNet follows an encoder-decoder architecture that is common to many segmentation networks. The encoder section consists of convolutional and pooling layers to successively downsample the input image. The decoder section consists of convolutional layers to upsample it back to the image’s dimensions. Where it differs from other networks is in using the idea of pooling indices to reverse the effects of pooling operations, making it lighter on processing and memory than other networks.

3. ENet

ENet is a variant of ResNet that uses residual connections between all layers to improve the flow of features across all layers. Enet’s specialty lies in being extremely lightweight — it’s light enough to run on smartphones, making it an excellent model for mobile apps.

4. DeepLabv3

DeepLabv3 is a slightly older architecture that uses concepts like atrous convolutions for the upsampling step.

5. Mask R-CNN

Mask R-CNN consists of two networks — a Faster R-CNN object detection network followed by an FCN.


food segmentation networks for ai in nutrition
Comparison of food segmentation networks (Source: Freitas et al.)

The paper compared these five architectures on five metrics:

  • Intersection over union
  • Positive prediction value
  • Sensitivity
  • Specificity
  • Balanced accuracy

They found that the two FCN-based models reported the best values, significantly outperforming SegNet, ENet, and DeepLabv3. However, remember that some architectures — like ENet — deliberately sacrifice accuracy to efficiently run on resource-constrained devices like smartphones.

Food Segmentation Results

food segmentation with deep learning results
Food segmentation results (Source: Freitas et al.)

The image above shows the segmentation results of the five models on example food items. Notice how all of them were able to isolate the food from containers and other non-food objects. 

However, food identification is usually just the first step. In the next section, we examine how it’s used to get important nutritional information.

Estimate Calories and Nutrition in a Portion of Food

Many people try to be very conscious of their calorie and nutrient intake for various reasons — dietary, fitness, weight loss, health care, physical activity, or just general well-being. People monitor their intake values based on the daily values and serving sizes printed on food labels. Ideally, their intake levels should conform to the metrics defined by medical bodies, like the dietary reference intake, the recommended dietary allowance (RDA), or the adequate intake.

However, every person has different ideas of what a portion of food should be. Nutrition research has shown that most people are not good at estimating nutritional information accurately from food photos. A smart system that can accurately estimate the calories and nutrition in any quantity of food can be very helpful for health and fitness. It can act as a nutritionist, recommending personalized meal plans.

In this section, you’ll learn about a technique to do this by combining computer vision and natural language processing. It’s based on the paper Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information.

Step 1: Process a Nutrition Database Using Natural Language Processing

A nutrition database contains authentic calorie, macronutrient, micronutrient, and other types of nutritional information about thousands of ingredients and food items in a structured format. These nutritional values must be measured on standardized, and not ad-hoc, quantities of food.

The research team prepared this database from three sources:

  • The U.S. Department of Agriculture’s food composition databases
  • Data from nutrition websites and food manufacturer websites
  • Data from food labels

The text information in these databases, like the names of ingredients and food items, are converted to vectorized embeddings using Google’s universal sentence encoder so that they can be used for semantic searching later on. Semantic searching means matching text based on its meaning rather than keywords.

Step 2: Prepare a Recipe Database For Artificial Intelligence

The next step is to prepare a recipe database where each recipe is accompanied by real-world photos of the food items along with detailed ingredient lists and quantities. The research team prepared this by processing information on recipe websites. But public datasets like Recipe1M+ can also be useful here and are popular in artificial intelligence.

Step 3: Estimate Nutrition in Recipes Using Natural Language Processing

person looking at an online recipe

A typical recipe will list the ingredients and quantities it uses. Quantities are usually expressed in units like grams, ounces, milliliters, or pieces. The goal in this step is to accurately estimate the nutritional values for each ingredient, given its quantity in the recipe and the nutritional values in standard quantities mentioned in the nutrition database of step 1.

First, ingredient names are converted to vectorized embeddings using the same universal sentence encoder. This enables semantically matching them to the ingredient names in the nutrition database. Semantic matching is necessary because different people may use slightly different names, spellings, or phrases for the same item. For example, a recipe may use the word “eggplant” while the database uses the word “aubergine.”

gpt-3 based ingredient extraction
Go from unstructured recipes to extracted and structured ingredients with Artificial Intelligence

Quantities mentioned in the recipes are extracted using NLP based information extraction techniques. If an ingredient’s quantity cannot be reliably extracted, it’s excluded from the process to keep the dataset accurate.

At this point, given a recipe name, the system can report its list of ingredients, their quantities in the recipe, and the nutritional information for standard sizes of that ingredient. But so far it’s just a text retrieval system that uses natural language processing.

Step 4: Train a Food Identification Model For Image Recognition

In this step, the system is enhanced with image recognition capabilities. The goal is that when it sees a real-world photo of a food portion, it’s able to visually match it to one of the items in the recipe database.

The research team trained and evaluated two different deep neural networks:

  • ResNet
  • DenseNet

Both networks were pre-trained on the ImageNet database and then fine-tuned on the real-world food photos from the recipe database. Both are image classification networks — given a photo, they output a recipe name based on how similar it is to a photo in the recipe database.

Step 5: Train a Nutrition Estimation Model

Nutrition estimation model
Nutrition estimation model (Source: Ruede et al.)

But the end goal here is not food identification. Instead, it is to estimate the following nutritional information for the food portion shown in the photo:

  • Estimated portion size in units of mass
  • Number of calories
  • Grams of proteins, carbohydrates, and fats
  • Ingredients of the food item

Predicting the number of calories and other nutrient values is a regression task. Listing the ingredients in the food item is a multi-label classification task. Overall, this is a multi-task learning problem.

One approach to multi-task learning is to train independent networks for each task. But that approach fails to leverage information from the other tasks. So the research team used an end-to-end training approach where the same network executes all the tasks.

In this case, the food identification neural network’s final classification layer is replaced with a regression layer and a softmax layer. The regression layer consists of five neurons to estimate portion size, calories, weights of proteins, carbohydrates, and fats in the food portion. The softmax layer consists of 100 softmax neurons to output ingredient labels.

The total loss function for the full network consists of L1 losses for the four regression neurons and binary cross-entropy losses for the 100 ingredient neurons.

Step 6: Estimate Nutrition From Food Photos

For inference, given a photo, it outputs the following:

  • Food item name
  • Portion size
  • Number of calories
  • Grams of proteins
  • Grams of carbohydrates
  • Grams of fat
  • Top 100 probable ingredients in the item

How Calorie and Nutrition Estimation Using Regression Works

The estimation of calories and nutrition using regression works like this. During the end-to-end training phase, the food photos, portion sizes reported in recipes, calories per unit mass, and nutrition per unit mass from the nutrition database form the inputs to the model.

The convolutional layers of the network produce filter maps that embed visual features about the food items, like colors, textures, color and texture transitions, size of food relative to other items in the scene like plates, and so on. For each image, all the filter maps are linearized to a single vector embedding that becomes the input to the regression layer.

The regression layer learns to associate the visual features embedded in these vectors with the food’s per-unit-mass calorie and nutrition values. If the training images include a wide range of portion sizes and food items, the ability of the model to generalize becomes better. The model essentially predicts accurate calorie and nutrition values by eyeballing the different features of the food in a photo.


nutrition estimation with deep learning
Nutrition estimation results (Source: Ruede et al.)

The image shows some sample outputs produced by the model along with nutritional estimates.

Potential Enhancements

The six steps in this approach are easily customizable for your use cases. Below are some potential enhancement ideas made possible by advances in deep learning:

  • The embeddings for ingredient names can be generated by more capable transformer models like GPT-3.
  • Information extraction models can be used to identify ingredient quantities and units in recipes instead of rule-based matching.
  • More capable, recent models like vision transformers can do the food identification instead of models based on ResNet or DenseNet.
  • If transformer models are not feasible, the accuracy of existing ResNet models can be enhanced using domain-specific neural modules designed for better food recognition, like wide-slice and vertical slice convolution layers.
  • Food identification based on photos in a recipe database is an example of closed-set recognition. Generally, it may not accurately recognize images outside that database. In contrast, an open-set model trained on large real-world image datasets can recognize a much larger set of food items.

An Exciting Time for Artificial Intelligence in Nutrition

food recognition understanding with SHAP
Using SHAP to understand how food recognition models go from a low understanding of the food item present to a high level with heat maps that show areas the model considers “important”

Greater awareness in nutrition science and progress in genetics, improving health outcomes, and artificial intelligence have enabled advances like the identification of food ingredients that act like disease-fighting drugs. It’s now possible to automate personalized nutrition plans and dietary assessments that are based on the patient’s genetic profile and medical history. Smartphones can provide instant personalized information on nutrition and health based on photos or medical sensor readings.

At, we have years of technical and domain expertise working on smart health systems that use the latest deep learning and machine learning algorithms to extract insights from photos, medical reports, and other complex real-world data. We have built highly efficient food recognition systems with 98.57% accuracy

Get in touch to see how we can build custom artificial intelligence models for the nutrition and human health domain. 


  • Charles N. C. Freitas, Filipe R. Cordeiro, Valmir Macario. “MyFood: A Food Segmentation and Classification System to Aid Nutritional Monitoring” arXiv:2012.03087 [cs.CV]

Robin Ruede, Verena Heusser, Lukas Frank, Alina Roitberg, Monica Haurilet, Rainer Stiefelhagen. “Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information” arXiv:2011.01082 [cs.CV]