A Deep Guide to Text-Guided Open-Vocabulary Segmentation

Matt Payne
May 28, 2023

Promptable segmentation outputs
Promptable segmentation (Source: Zou et al.)

Large language models (LLMs) like GPT-4 are increasingly being used to automate complex business workflows.

Complex computer vision tasks are becoming increasingly accessible to laypersons, thanks to text-guided models that combine image processing with large language models. In this article, we explore text-guided open-vocabulary segmentation, a necessary first step to creating an AI agent capable of visual segmentation.

What Is Text-Guided Open-Vocabulary Segmentation?

In an earlier article, we explained a list of segmentation concepts like semantic, instance, panoptic, referring, zero-shot, one-shot, and few-shot segmentation. But what is text-guided, open-vocabulary segmentation and where does it fit within that list?

To understand that, you must first observe the biggest trend in AI right now — the advent of LLMs like GPT-4, ChatGPT, and LLaMa. Their ability to understand complex natural language instructions and automatically work out suitable action plans by coordinating multiple AI agents is improving by the day. Natural language is increasingly becoming a sort of universal programming language to instruct AI agents and automate any kind of workflow in any domain.

Text guidance for image processing tasks like segmentation fits into this AI workflow. In the past, you'd need to know how to use Photoshop or hire a developer to write image processing code. But now, a non-coding artist or video editor can just talk to an image processing agent, give it some editing instructions using layperson's language, and it'll carry out the segmentation or other tasks. This is called text-guided segmentation.

Additionally, these editing instructions need not be limited to the word and concept annotations that the segmentation model is trained on. By leveraging the impressive linguistic knowledge of LLMs,  the model can understand instructions it hasn't seen before during training and map them to the visual concepts it finds in new unseen images. This is called open-vocabulary segmentation and is another name for zero-shot semantic segmentation when text is used.

Below, we explore two state-of-the-art pre-trained vision-language models for text-guided segmentation.

Segment Anything Model for Text-Guided Segmentation

segment anything architecture from META
SAM architecture (Source: Kirillov et al.)

Inspired by LLMs that can handle any language task with suitable prompts, the segment anything model (SAM) by Kirillov et al. from Meta is a general-purpose approach for promptable segmentation that accepts multimodal prompts, including text prompts, and returns meaningful pixel-level segmentation masks for images.

Further, we can use prompt optimization frameworks to iteratively generate optimized hard prompts that improve the accuracy of the segmentation masks.

SAM's Capabilities

SAM promptable segmentation example from research
SAM promptable segmentation (Source: Kirillov et al.)

SAM's versatility in prompting equips it to support a wide variety of segmentation tasks. It supports these tasks categorized by the nature of results:

  • Semantic segmentation: All objects matching the prompts are segmented.
  • Instance segmentation: Specific objects of the same type can be selected by combining SAM with an object detection model that produces bounding box prompts.
  • Panoptic segmentation: It combines the outputs of both semantic and instance segmentation.

For each of the above tasks, SAM is capable of these subtasks categorized by the nature of inputs:

  • Referring or text-guided segmentation using text prompts
  • Geometric hints for segmentation, like points, boxes, and partial masks
  • One-shot segmentation using images as prompts
  • Zero-shot segmentation by supplying prompts not seen during training

In addition, SAM segments with millisecond response times on CPUs. It's also ambiguity-aware, meaning that if a prompt is ambiguous, SAM returns multiple candidate masks with confidence scores.

Together, these capabilities make it suitable for any real-time segmentation task with user interaction on any device.


The image below shows zero-shot, text-guided, open-vocabulary segmentation using SAM:

Text-guided and composite segmentation using SAM (Source: Kirillov et al.)

Even when given an uncommon text prompt like "beaver tooth grille," SAM is still able to map it conceptually to the image and isolate the object with a segmentation mask. It doesn't do so well on some prompts like "a wiper." But supplying an additional point hint by the user helps it select the object. It's even capable of segmenting according to singular or plural forms, like "a wiper" versus "wipers."

SAM's Approach

SAM advances state-of-the-art image segmentation with three contributions:

  • Promptable segmentation: SAM accepts multiple modalities of prompts and their combinations. Its prompt encoder has built-in support for prompts supplied as key points, approximate bounding boxes, partial masks, and text queries, and their combinations. But the overall architecture is versatile enough to accommodate other modalities of prompts — like images, videos, or audio — too.
  • A large-scale segmentation dataset: The SA-1B dataset, with one billion masks from 11 million images, is about 400 times larger than existing segmentation datasets and powers the ability of SAM to be a versatile foundation model for any segmentation task. Thanks to its Apache 2.0 license, you can use it for commercial purposes too.
  • Pre-trained models: SAM provides powerful models that are pre-trained (on SA-1B) and capable of promptable segmentation out-of-the-box. Like other foundational models, you can use them directly or fine-tune them on your custom datasets for specific segmentation goals like instance segmentation or panoptic segmentation.

In the next section, we explore the architecture and internals of these pre-trained models.

Model Architecture

SAM is a standard encoder-decoder transformer. Its magic lies in its unique mask decoder design and its prompt encoder component that influences the decoder's activations through user-supplied prompts in the form of text, points, boxes, or masks. We explore the internals of each of these components.

Image Encoder

Given an input image, the image encoder's job is to derive an embedding (tensor representation) that encodes all its visual characteristics like shapes, contours, colors, edges, textures, and more in a multi-dimensional vector.

In principle, you can use any neural network as an image encoder. SAM chooses to reuse the pre-trained, vision transformer-based, masked autoencoder model from the 2021 paper, Masked Autoencoders Are Scalable Vision Learners. It expects a 1024x1024 RGB image, splits it into 14x14 pixel patches, and encodes its visual aspects to generate a 256x64x64 embedding tensor per image.

SAM defaults to the ViT-Huge (ViT-H) model with 636 million parameters that generate 1,280-dimensional embeddings. However, as this graph from their ablation study shows, the ViT-Large (ViT-L) model, which weighs just about half of ViT-H with 308 million parameters and produces 1,024-dimensional embeddings, performs equally as well. Both show considerable improvements over the small, 91-million-parameter, 768-dimensional ViT-Base (ViT-B).

SAM image encoder models (Source: Kirillov et al.)

For smartphone or web browser deployments, we recommend the ViT-L model as the encoder.

Prompt Encoder

The prompt encoder's job is to encode the prompts, in whichever modality the user has chosen, to prompt embeddings. This is achieved in different ways depending on the modality as explained in the following sections, but all of them generate several prompt tokens with a 256-dimensional embedding for each token.

How Text Prompts Work

SAM uses OpenAI's popular contrastive loss image pre-training model (CLIP) as a text encoder to encode free-form text prompts as word embeddings. Specifically, it uses the largest publicly-available CLIP model, ViT-L/14@336px.

How Point, Box, and Mask Prompts Work

A point prompt is encoded as a sum of its positional encoding on the image with a pre-learned embedding that represents a point as being in the mask foreground or background.

Similarly, a box prompt is encoded as a pair of positional encodings, one for its top-left corner and another for its bottom-right corner.

Mask prompts are encoded differently, however, because they're dense. A mask prompt is passed through a convolutional block to generate an embedding.

Mask Decoder

SAM mask decoder architecture from research paper
SAM mask decoder architecture (Source: Kirillov et al.)

All the actual segmentation happens in the mask decoder. It consists of the following components:

  • Two two-way attention blocks to determine self-attention and cross-attention between the image-level embedding and the prompt tokens
  • A segmentation mask generating multi-layer perceptron (MLP)
  • A transpose convolution block to upscale the embedding to the image size
  • A second MLP to calculate the intersection-over-union (IoU) and confidence metrics

The inputs to the mask decoder are:

  • The image embedding from the encoder
  • The prompt tokens from the prompt encoder
  • Its output tokens from the previous iteration

We explore the decoder's components next.

Two-Way Attention Blocks

The mask decoder has two stages of two-way attention blocks. Their job is to semantically understand and associate the objects in the image and the concepts in the prompts. For this, given a set of image and prompt tokens, they derive mask tokens and mask embeddings and update all the mask and image embeddings based on cross-attention.

The inputs are processed by the four layers in each block.

The first layer is a multi-head attention layer that determines self-attention weights for the sparse inputs, between the prompt tokens and between the previous output tokens. The higher a pair of tokens scores, the more semantically associated they are compared to other pairs.

The second layer is a multi-head attention layer that determines cross-attention between the sparse prompt tokens and the dense input image tokens. Some of the semantic associations between the concepts in the prompts and the objects in the image are determined here.

The third layer is a multi-layer perceptron (MLP) to dimensionally expand the self-attention and cross-attention weights determined so far.

The fourth layer is a multi-head attention layer that determines cross-attention in the opposite direction, from image embedding to the prompt tokens. This is why they're called two-way attention blocks.

Segmentation Mask MLP and Upscaling

The updated embeddings must be upscaled to match the input image size. This is done by two transposed convolution blocks.

The mask tokens generated are mapped to per-pixel probabilities by a three-layer classifier MLP. It outputs 1024x1024-sized segmentation masks that isolate objects corresponding to the user prompt.

Scoring MLP

A smaller regression MLP ranks candidate masks by IoU scores calculated from the mask tokens.

Training and Fine-Tuning

As SAM is already pre-trained on the massive SA-1B dataset, it already knows how to segment most real-world images.

To improve the IoU metrics on your specific data, you can freeze some of the mask decoder's layers and fine-tune the weights of the rest using your custom images.

SAM for Instance Segmentation

For instance segmentation, the recommended approach is a pipeline that combines object localization and SAM. The object detector acts as a prompt generator by detecting object instances and generating bounding boxes. These boxes are passed to SAM's prompt encoder. SAM generates a segmentation mask for each detected object.


SAM is quite versatile. However, as the paper itself shows, it gets some prompts wrong. This points to some weaknesses in its semantic capabilities.

In the next section, we explore an enhanced model that is designed to overcome some of SAM's weaknesses.

SEEM: An Improved SAM for Open-Vocabulary Segmentation

Proposed by Zou et al. just days after SAM was released, the "segment everything everywhere all at once" model (SEEM) is also a promptable segmenter but promises to be even more versatile than SAM. In this section, we explore its capabilities and internals.


SEEM prompt capabilities
SEEM prompt capabilities (Source: Zou et al.)

As the illustration above shows, SEEM's prompt support is much more versatile and allows for much richer user interaction for segmentation tasks. These strengths include:

  • Accepting open-vocabulary text prompts
  • Supporting geometry prompts like points, boxes, polygons, scribbles, and masks
  • Using the referred regions of another image
  • Supporting audio prompts
  • Accepting composite prompts that combine text, visual, and other modalities
  • Supporting memory prompts that remember previous prompts like a chatbot and enables incremental segmentation


The images below show SEEM's open-vocabulary abilities on different kinds of images like photos, drawings, and cartoons:

example recognition
granular example of prompt based segmentation
sized based segmentation again

These examples shows the models understanding of size of various objects in a comparative framework against others.

pixel based recognition

SEEM text-guided open-vocabulary segmentation examples (Source: Zou et al.)


SEEM approach architecture
SEEM approach (Source: Zou et al.)

The key to SEEM's prompt versatility is that it first embeds all prompts in a joint image-text embedding space, much like CLIP. This is in contrast to SAM, which uses a different embedding space for its text, geometry, and mask prompts.

By embedding all visual prompts and text prompts in the same joint visual representation space, SEEM can support composite prompts that enable users to express their intent unambiguously by combining text and visual prompts.

Another benefit of using a joint embedding space is that semantically-related text embeddings can be derived for visual prompts, making it highly semantically aware at all times. This is useful for incremental segmentation using memory prompts where a mask from a previous request is easily turned into a semantic query to provide the boundaries for a new request.

SEEM's higher levels of semantic awareness and better interactivity vis-a-vis SAM are depicted below:

SEEM vs. SAM (Source: Zou et al.)

Model Architecture

SEEM adapts the X-Decoder architecture shown below:

SEEM architecture based on X-Decoder for SEEM
SEEM architecture based on X-Decoder (Source: Zou et al.)

For the vision backbone, SEEM uses a dual attention visual transformer (DaViT). For the language encoder, it uses either the unified contrastive learning model or Florence.

For embedding visual prompts like boxes or scribbles, it uses the visual backbone to extract features from the vicinity of the prompt and embeds them in the joint text-image embedding space.


Unlike SAM's massive SA-1B, SEEM is trained on modest volumes of data.


SEEM is compared with other modern segmentation architectures below. The (T) and (B) refer to tiny and baseline models depending on which visual and language backbones are used. We can see that its metrics compare well against other models:

SEEM model benchmarks (Source: Zou et al.)
SEEM model benchmarks (Source: Zou et al.)

Though it's just trained on the COCO dataset, SEEM holds its own. It's possible that if trained on the SA-1B dataset, it may outperform everything shown here, including SAM.

SAM Shortfalls & How To Solve

Infamous sheep image for SAM shortfalls

We’ve seen that both of these models can struggle when provided images where the objects we want to segment contain different backgrounds or overlapping backgrounds. The model's reliance on understanding the entire scene relative to its trained vocabulary make images like this difficult to complete. It was pretty surprising to see that images that visually are as simple as this can give a model with this much training issues, and I made a few jokes about it on Linkedin.

can only recognize one of the animals with a generalized prompt

As we can see this sheep causes issues. Here’s another prompt that's even more granular and still isn’t right.

another prompt that is more granular

These are pretty simple image examples that we would think the model could understand. While I can’t think of many use cases where you’d have something like this, it’s a pretty good demonstration of the issues. Here’s another one where the results from SAM are baffling.

multi-animal instance

How to overcome: Grounded Segment Anything

This pipeline as outlined by IDEA-Research combines the works of Grounding DINO and Segment Anything and works to detect and segment items with text inputs. The goal of this pipeline is to combine the strengths of both of these tools to solve more complex problems including segmentation.

Grounding DINO is an innovative deep learning framework designed to tackle object detection and referring expression comprehension (REC) tasks. It takes advantage of a dual-encoder-single-decoder architecture to effectively identify and extract relevant information from input images and text. In this section, we provide a brief overview of Grounding DINO's main components and how they come together to achieve state-of-the-art performance in both object detection and REC tasks.

Key Components of Grounding DINO

1. Feature Extraction and Enhancer: Grounding DINO uses an image backbone (e.g., Swin Transformer) to extract multi-scale image features and a text backbone (e.g., BERT) to obtain text features. Once the vanilla image and text features are extracted, they are fed into a feature enhancer module for cross-modality feature fusion. This module helps align features of different modalities, enhancing their representation.

2. Language-Guided Query Selection: To effectively leverage the input text to guide object detection, Grounding DINO employs a language-guided query selection module. This module selects features from the image that are more relevant to the input text as decoder queries , initializing them for further processing.

3. **Cross-Modality Decoder**: The cross-modality decoder combines image and text modality features, allowing the model to better align information from both sources. Each cross-modality query goes through a series of self-attention, image cross-attention, text cross-attention, and feed-forward neural network layers in the decoder. This structure ensures that text information is effectively injected into the queries for improved modality alignment.

4. **Sub-Sentence Level Text Feature**: Grounding DINO introduces a "sub-sentence" level representation that eliminates unwanted word interactions while maintaining fine-grained understanding. Attention masks are used to block attentions among unrelated category names, preventing unnecessary dependencies while retaining per-word features.

5. **Loss Function**: The model uses a combination of L1 loss, GIOU loss for bounding box regressions, and contrastive loss with focal loss for classification. These losses are used in conjunction with bipartite matching and auxiliary loss in a similar fashion to existing DETR-like models.

You can see it gets all of our animals!

Grounding Segment Anything results
Grounding DINO  Segment Anything results on all animals

Business Uses of Text-Guided Open-Vocabulary Segmentation

The following examples demonstrate uses of text-guided segmentation in different businesses.

1. Retail

In retail settings, inventory tasks like cycle counting and quarterly counting are time-consuming but necessary chores. With text-guided segmentation, these tedious tasks can become automated. A store employee equipped with a smartphone can point the camera at a shelf and speak or type the products or produce they're interested in counting.

In the illustration below, the system is asked to segment out "oranges:”

SAM recognizing oranges
Segment "oranges" on a retail shelf (Source: Original Photo by gemma on Unsplash) Oranges are covered by green labeler

2. Bookstores and Libraries

Similarly, other time-consuming inventory tasks like cataloging in bookstores or libraries can be streamlined using text-guided segmentation. In the example below, we segment out all the "blue books" on a bookshelf:

Book shelf (Photo by Paul Melki on Unsplash)

bookshelf segmentation
Segment "blue books" on a shelf (Original photo by Paul Melki on Unsplash)

Text-Guided Image and Video Processing in Your Enterprise

Large language models, natural language prompts, and AI agents that can understand them and automate workflows are the trends of the future in the workplace. With them, you can automate your business workflows like retail inventory management, invoice matching, text extraction from scanned and handwritten documents, document understanding, information extraction from documents, signature verification, and more.

Contact us for guidance on automating complex image and video processing workflows using large language models like GPT-4 and ChatGPT to streamline your business operations.


  • Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick (2023). "Segment Anything." arXiv:2304.02643 [cs.CV]. https://arxiv.org/abs/2304.02643v1
  • Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, Yong Jae Lee (2023). "Segment Everything Everywhere All at Once." arXiv:2304.06718 [cs.CV]. https://arxiv.org/abs/2304.06718v1
  • Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao (2022). "Generalized Decoding for Pixel, Image, and Language."   arXiv:2212.11270 [cs.CV]. https://arxiv.org/abs/2212.11270