What is CLIPSeg & how we build optimization pipelines for text-guided zero-shot segmentation

Matt Payne
May 25, 2023

CLIPSeg segmentation on retail shelves
Text-guided CLIPSeg segmentation (Original photo by Franki Chamaki on Unsplash)

In recent months, large language models (LLMs) like GPT-4, LLaMa, GPT-J, and others have increasingly been adopted by different industries. While initially limited to language tasks, they have since been integrated with software agents that interface with other systems to carry out tasks.

So, the LLMs have now become like artificial brains that take natural language instructions from users, deconstruct them into steps, and instruct agents — again using natural language instructions — to carry out tasks in the real world.

In this ecosystem, agents that can accept natural language instructions to carry out different tasks are crucial. One such task is image segmentation. As an example of its use, a retail employee can instruct an LLM to "Count all the shampoo bottles on this shelf." The LLM in turn instructs a segmentation agent to look for "shampoo bottles" in a camera feed and isolate them.

In this article, we look at a segmentation model called CLIPSeg that is built for such text-guided tasks and integration with LLMs as an agent. Specifically, we'll understand how CLIPSeg uses OpenAI's contrastive language image pretraining model, or CLIP, for text-guided zero-shot and one-shot segmentation.

What Is CLIPSeg?

CLIPSeg, proposed by Lüddecke and Ecker, is a language-image model for semantic segmentation that segments images using the text prompts or prototype images you provide.

Under the hood, CLIPSeg leverages OpenAI's powerful CLIP image-text model that enables seamless use of text-prompting for computer vision tasks.

CLIP equips CLIPSeg with the following image segmentation tasks:

  • Referring segmentation: Segment an image based on text prompts that the model has seen during training.
  • Zero-shot segmentation: Segment an image based on novel open-vocabulary prompts that the model has not seen during training.
  • One-shot segmentation: Segment an image by showing it a prototype image of an object of interest.

Its text and visual prompting capabilities enable CLIPSeg to work as an agent with large language models like GPT-4 to facilitate many business use cases.

CLIPSeg Architecture

clipseg architecture
CLIPSeg architecture (Source: Lüddecke and Ecker)

CLIPSeg has a transformer-based, encoder-decoder architecture. Its encoder is a pre-trained CLIP vision-language model based on the vision transformer (ViT-B/16) network. It generates an embedding and attention activations for a target image.

For the decoder, CLIPSeg stacks three standard transformer blocks that combine the target image embedding, its activations, and the conditioning prompt to output a binary segmentation mask.

We explore all these architectural aspects in more detail in the next sections.

Encoder-Decoder Connections

encode-decoder skip connections
Encoder-decoder skip connections (Source: Lüddecke and Ecker)

A single embedding, generated by the encoder for a target image, is insufficient to identify precise segmentation masks. A lot of useful information required by the decoder is not available in the embedding. However, it is available in the attention activations of the encoder's transformer blocks.

So CLIPSeg opts for a U-Net-like architecture where its three decoder blocks are connected to three of the vision transformer encoder's blocks. Their attention information enables the decoder blocks to more accurately infer both nearby and distant spatial and semantic relationships in the image.

Conditioning on Prompts With Feature-Wise Linear Modulation

In this section, we explore CLIPSeg's technique to make text or image prompts influence the decoder's segmentation results.

Text or image prompts are additional multimodal inputs that the model must combine with a target image in some way. Influencing a model's results in this fashion, with additional input, is called "conditioning." For example, a popular conditioning technique for natural language processing (NLP) combines token position embeddings with input text embeddings. But that approach just influences the first layer.

Conditioning techniques that influence not just the first but every layer of the model give better results. CLIPSeg uses one such technique called feature-wise linear modulation (FiLM), proposed by Perez et al.

How FiLM Works

Prompts influence the decoder's segment activations through feature-wise modulation (Source: Perez et al.)
Prompts influence the decoder's segment activations through feature-wise modulation (Source: Perez et al.)

FiLM works as follows:

  • It obtains the feature maps from a selected transformer block in the CLIP encoder. We'll call this feature matrix F(i,c), indicating that it's the cth feature map for the ith input token. Since we're talking about visual transformers here, the ith input token is just the ith image patch.
  • It applies an affine transform — 𝛄(i,c) ○ F(i,c) + 𝛃(i,c) — to the layer's feature map to generate a new conditioned feature map.
  • 𝛄 and 𝛃 are feature-wise weights produced by two arbitrary functions applied on the conditioning prompt. They are different for each feature map — hence the term "feature-wise."
  • In practice, 𝛄 and 𝛃 are learned by two linear layers of the CLIPSeg model during training. During inference, these layers generate the conditioning weights from a given prompt.
  • Since they're combined using an affine transform, the new feature maps will retain the spatial relationships of the original feature maps.

Having understood FiLM, we can now explore how CLIPSeg works with different prompts.

How Zero & Few Shot Segmentation With Text Prompts & Visual Prompts Works

text prompting clipseg
Text prompting (Source: Lüddecke and Ecker)

Zero-shot text-guided segmentation is when the text prompt you supply isn't part of the CLIP dataset or training. It's a novel prompt that CLIPSeg is seeing for the first time. Despite that, it infers visual concepts that are semantically related to that prompt, identifies them in the query image, and segments them with the following steps.

In one-shot segmentation, you supply three inputs — the target image, a prototype image of an object you're interested in, and a mask to isolate that object in the prototype image. The latter two comprise the visual prompt. One-shot segmentation looks for objects in the target image that resemble the object and segments them using the following steps.

1. Get a CLIP Embedding for the Text or Text & Image Prompt

In zero-shot environments the supplied text prompt is run through CLIP's text encoder to obtain an embedding vector. This is a bit different in few-shot as a visual prompt consisting of the example object image and its object mask is run through CLIP’s visual encoder. Note that this is not a simple image processing operation of isolating the object using the object mask and then obtaining an embedding for the isolated object. Instead, the object mask must condition the multi-head attention weights themselves so that attention is restricted to the image patches in the unmasked areas while ignoring the masked areas.

In CLIP's joint text-image embedding space is prepared using contrastive pair training, related text-image pairs, along with their semantically nearby texts and images, cluster close to one another.

So the embedding for the text prompt will be near to its semantically related image embeddings. Essentially, CLIPSeg converts the text prompt to its equivalent visual concept in the form of an embedding.

2. Run the Target Image Through the CLIP Encoder Blocks

The target image to segment is run through CLIP's visual encoder network, which is just a vision transformer ViT-B/32 model by default. However, its goal is not to obtain an embedding for the image.

3. Extract Attention Activations From Selected Encoder Blocks

Instead, it's interested in the attention activations of the third, sixth, and ninth encoder blocks to act as sources of skip connections to the decoder blocks. To keep the decoder model light, just these three blocks are selected, but more can be added for improved accuracy.

4. Condition the Activations With the Prompt Using FiLM

Next, CLIPSeg conditions the last selected encoder block's activations on the prompt embedding (from the first step) using FiLM. The prompt embedding is supplied to the two FiLM linear layers to obtain 𝛄 and 𝛃 matrices. They are combined with the encoder block's activations through an affine transform.

5. Pass FiLM's Conditioned Results Through the Decoder Stack

decoder blocks in clipseg
Decoder blocks (Source: Lüddecke and Ecker)

By default, the CLIPSeg decoder has three transformer blocks.

  • First decoder block: Uses the conditioned features from the previous step and combines them with the activations from its skip connection to the ninth encoder block
  • Second decoder block: Combines the first block's activations with the activations from its skip connection to the sixth encoder block
  • Third decoder block: Combines the second block's activations with the activations from its skip connection to the first encoder block

These three transformer blocks output the image patch tokens that constitute the segmentation mask for the given prompt.

6. Generate a Segmentation Mask From the Mask Tokens

The final steps convert these mask tokens to a binary segmentation mask that's the same size as the target image. It does this using a transposed convolution layer (or a couple of them). These layers, also called deconvolution layers, upsample the decoder's mask tokens to a binary segmentation mask that's the same size as the image.

Improving Zero-Shot Segmentation Results With Our Pipeline

Let’s start out by taking a look at a few product shelf images with simple prompts. These are the first prompts that would come to mind when using a model like this and trying to extract exact stuff from shelves. The original research paper clearly outlines a few ways to optimize the image prompts in a few-shot environment, but zero-shot is a much more difficult task and fits more use cases.

For the prompt "orange bottles," notice that CLIPSeg segmented mostly just the orange bottles or bottles with orange wrappers on them:

segment orange bottles
Segment "orange bottles" (Original photo by Franki Chamaki on Unsplash)

These are pretty good results considering the zero-shot nature. But they lack the accuracy needed at a production level, and missed a few key bottles that we would expect the model to segment. The heatmap clearly shows a pretty close confidence score between the yellowish bottles right above the deep red confidence bottles and the larger Sunkist bottles in the top right. It also grabs a few red bottles as a part of a larger segmentation chunk that we definitely don’t want. If we’re going to use a threshold score + size score to extract these in a pipeline we’d like to reduce the size and green color confidence score of the yellowish bottles, and increase both of these for clearly orange bottles. Both of these would increase the accuracy over a given dataset.

One thing to note - We’ll assume the range of what we consider the color orange to be is wide, from light to dark orange. If we wanted to focus on one or the other the adjustment would have to be made here in our “high level” prompt to focus on one or the other.

Zoomed in on the well recognized Sunkist bottles with a bit of noise
Zoomed in on the well recognized Sunkist bottles with a bit of noise

Optimization Pipeline - Extraction

The first thing we do is recognize the largest “chunks” of recognition in our image and evaluate them for likelihood of being products. We do this with a custom model we trained for evaluating confidence scores and segmentation size, but this can be done pretty well with a much simpler approach as the later steps will clean up any wrongly classified segmentation chunks. This equation is very similar to other tasks like price tag recognition where we care about the surrounding area and any products touching this defined area. We crop these smaller areas out of the image with padding around the area.

cropped orange bottles

Now when we run the same simple prompt we see a much tighter segmentation with very high confidence values around the correct products. This works well here already with a pretty simple chunk but as we’ll see below we want to keep the optimizations coming!

simple prompt for orange bottles

The approach up to this point where we just crop and rerun the same text prompt works well on some chunks, but does not solve all of our issues. For the middle segmentation chunk we have a problem when using the original “orange bottles” prompt.

poorly recognized bottles

Although the confidence score isn’t as high on the red bottles as it is on the orange ones, it’s pretty high compared to some of the ones we saw in the original image and even has a few light red sections.

Text Prompt Optimization

We trained a GPT model to optimize the provided original high level prompt into a more granular version that takes advantage of the models understanding of specific keywords and phrases. The concepts behind how it’s trained and evaluated stem from the work done here on hard prompt optimization. I would consider these prompts to still be soft prompts, as they’re still very readable, but do incorporate many of the ideas. You can also set this model up in a few-shot environment to get it off the ground and have it work very well. We provide our high level prompt and the outputs of an image-to-text model to generate this prompt: 🍊🍾orange bottles🍊🍾. You can see the outputs are much better for this specific chunk. While in theory you could do this manually, the automated pipeline nature is what we’re after.

improved prompt from above

Same simple prompt on the lower left chunk we saw before. While the recognition looks the same at first glance if you look closely at the heatmap you’ll notice the red confidence got darker and the heatmap piece covering the other bottles became lighter with less red.

better image from above

CLIPSeg does a great job with this very high level prompt with little granularity in the goal state output.

recognizing the full shelf of boxes
"Retail boxes" (Original photo by Franki Chamaki on Unsplash)

Optimizing Image Prompts In One-Shot Segmentation

The paper also explores whether the visual prompt for one-shot segmentation — consisting of a prototype image of the object of interest and a mask to isolate that object — can be pre-processed in some way to improve the probability of detecting that object in the target image.

Image prompt pre-processing effects (Source: Lüddecke and Ecker)
Image prompt pre-processing effects (Source: Lüddecke and Ecker)

After trying out various combinations, the researchers recommend the following pre-processing steps on the prototype image:

  • Reduce the background intensity to 10%.
  • Blur the background around the object of interest, using a Gaussian filter.
  • Crop the image down to focus on the object of interest.

The segmentation probabilities with and without pre-processing are depicted below:

Image prompt pre-processing example (Source: Lüddecke and Ecker)
Image prompt pre-processing example (Source: Lüddecke and Ecker)

One-Shot Segmentation For Our Images

In this one-shot segmentation demo, we want to find all cereal boxes with a certain color or of a certain brand on this retail shelf:

Full shelf of cereal.
A shelf with cereal boxes (Photo by Franki Chamaki on Unsplash)

Instead of just optimizing our text prompts and building new pipelines, we can supply an image example of our specific box we’re interested in to create granularity.

Corn Chex
A prototype image as prompt (Original photo by Franki Chamaki on Unsplash)

CLIPSeg generates the following segmentation mask to match the prototype image:

Selected products (Original photo by Franki Chamaki on Unsplash)

This is a great way to add granularity to our pipeline for specific SKUs or objects without requiring difficult prompt optimization.

CLIPSeg Training

The goal of CLIPSeg training is to train its decoder stack consisting of three transformer blocks, the FiLM layers, and the transpose convolution layers. It does not retrain or fine-tune the CLIP encoder stack.

The model can be trained on different datasets, like PhraseCut or COCO. The training uses simple binary cross-entropy loss, between the predicted binary mask and the ground truth, as its loss function and minimizes that loss.

Comparisons With Other Models

Let’s take a look at how CLIPSeg compares to other SOTA segmentation models.

Research Comparison of Zero-Shot Segmentation

On zero-shot segmentation, CLIPSeg achieves high mean intersection-over-union (mIoU) scores on unseen classes (which is the very objective of zero-shot segmentation):

Zero-shot comparison (Source: Lüddecke and Ecker)
Zero-shot comparison (Source: Lüddecke and Ecker)

It outperforms all the other networks on unseen classes by a huge margin. This shows the power of the CLIP model to find semantically related prompts.

But on the seen classes, its mIoU benchmarks aren't as impressive relative to other models. This is because the other state-of-the-art models are trained on thousands of images spanning just a dozen classes from the Pascal dataset, while CLIPSeg, though trained on a much larger number of classes, has to learn from far fewer images per class. Plus, CLIPSeg's segmentation masks are not as precise compared to the other models below that are optimized on large image datasets with few classes.

Image Comparison of Zero-Shot Segmentation

Grounded Segment Anything model

The Grounded Segment-Anything model does a great job as well of extracting just the orange bottles in a zero shot environment. This model combines the works of the Segment-Anything model and Grounding DINO.

The same can’t be said for the Segment-Anything with CLIP model, which does not perform as well for this use case as our above work.

Segment-Anything with CLIP
Prompt: “Orange Boxes”

The idea behind this pipeline is actually very similar to what we outlined with our custom pipeline using GPT-4. This pipeline works like this:

  • Get all object proposals generated by SAM (Segment Anything Model).
  • Crop the object regions by bounding boxes.
  • Get cropped images' features and a query feature from CLIP.
  • Calculate the similarity between image features and the query feature.

Research Comparison of One-Shot Segmentation

On one-shot segmentation, CLIPSeg's metrics are on par with other segmentation models.

One-shot comparison (Source: Lüddecke and Ecker)
One-shot comparison (Source: Lüddecke and Ecker)

It's interesting to see how CLIP based visual backbones generally score lower than architectures with the older backbones ResNet50 and ResNet101.

Enhancements and Fine-Tuning

CLIPSeg's simplicity enables you to enhance and fine-tune it to your specific business needs. Any approach that can enhance or fine-tune CLIP will also benefit CLIPSeg.

CLIP Fine-Tuning

For example, you may want to fine-tune the model to your specific retail inventory so that during store cycle counting, your employees can segment products by brand and model, or even the stock keeping unit number (SKU). For that, you can fine-tune CLIP using a custom loss function to optimize for SKU similarity.

Architectural Optimizations

The architectural choices themselves provide opportunities to improve CLIPSeg's accuracy, including:

  • Using the larger CLIP encoder models, like the ViT-L large models
  • Including many more decoder blocks and their corresponding encoder activations
  • Using a more complex deep learning network for FiLM instead of two linear layers

Business Use Cases

CLIPSeg's simplicity and its ability to run on smartphones open up the following applications in different industries.

1. Retail

In retail environments, employees can use CLIPSeg for time-consuming tasks like cycle counting of shelves and detecting empty shelves or gaps in shelves. Employees just need to type or speak product names, and the model will locate and segment matching products on the shelf.

Similarly, customers in supermarkets can enter, or speak, their desired product names in kiosks  or on their smartphones. A CLIPSeg model then looks for matching products in the shop's camera feeds, identifies the exact shelves where they're located, and helps the customer get there.

2. Medical Diagnosis

Segmentation methods are extensively used in medical image diagnosis because, in addition to locations, the shapes and extents of objects are also crucial to diagnosis.

When using a text-guided model like CLIPSeg, medical technicians and professionals can just type, or speak, their objects of interest in a medical image like an X-ray or a CT scan or MRI that shows soft tissues. A CLIPSeg model that's fine-tuned on medical datasets can then automatically segment those objects in the images.

3. Remote Sensing

Drone and satellite imagery is another use case where contours and extents are as important as locations. The ability to quickly scan through massive images (which can be as large as 10,000x10,000 pixels) for objects of interest helps reduce time, effort, and costs.

4. Image and Video Editing

Image and video editing are other applications where segmentation is extensively used to isolate objects. Text-guided segmentation promises to improve the productivity of such time-consuming tasks.

How is CLIPSeg capable of doing all this? To understand that, we explore its model architecture next.

Leverage CLIPSeg for Zero-Shot and One-Shot Segmentation in Your Business

In this article, you got an in-depth understanding of how CLIPSeg works and a glimpse of its business applications. You can use it along with large language models in your business, too, to accelerate your business workflows and improve productivity in a variety of tasks like product identification, document understanding, image editing, and more.

Contact us to get started with integrating large language models for NLP with computer vision tasks like segmentation and object detection for your business workflows.


  • Timo Lüddecke, Alexander S. Ecker (2021). "Image Segmentation Using Text and Image Prompts." arXiv:2112.10003 [cs.CV]. https://arxiv.org/abs/2112.10003
  • Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville (2017). "FiLM: Visual Reasoning with a General Conditioning Layer." arXiv:1709.07871 [cs.CV]. https://arxiv.org/abs/1709.07871