Unlocking the Power of Generative Models: Learning Hard Prompts through Optimization for Image Generation and Language Tasks

Matt Payne
August 21, 2023

In the era of powerful generative models, controlling their behavior through text-based prompts has become a crucial aspect of harnessing their potential. The art of guiding these models, known as prompt engineering, is the key to unlocking their capabilities in various applications, such as image generation and language tasks. Prompt engineering techniques can be broadly classified into two categories: hard prompts and soft prompts. While hard prompts consist of hand-crafted, interpretable text tokens, soft prompts are continuous feature vectors that can be optimized through gradient-based methods. Despite the difficulty in engineering hard prompts, they offer several advantages over their soft counterparts, such as portability, flexibility, and simplicity.

In this blog post, we delve deeper into the world of prompt engineering by exploring the use of efficient gradient methods to optimize and learn discrete text for hard prompts. Our primary focus is on applications where these methods can be employed for prompt engineering, enabling the discovery of hard prompts through optimization. By combining the ease and automation of soft prompts with the portability and flexibility of hard prompts, we review a new technique that can learn hard prompts with competitive performance. 

The proposed method in the original research paper builds on existing gradient reprojection schemes for optimizing text, and adapts lessons learned from the large-scale discrete optimization literature for quantized networks.

Overview of related prompt optimization frameworks

Prompt engineering in language models has gained significant attention in recent years. The technique of using text-based instructions to guide pre-trained language models has demonstrated its effectiveness in various applications, such as task adaptation and complex instruction following. However, finding suitable hard prompts for specific tasks remains an open challenge.

Autoprompt framework

Existing discrete prompt optimization frameworks, such as AutoPrompt, have laid the foundation for optimizing hard prompts in transformer language models. Additionally, other approaches like gradient-free phrase editing, embedding optimization based on Langevin dynamics, and reinforcement learning have also been developed. These techniques, when combined with continuous soft-prompt optimization and hard vocabulary constraints, can lead to the discovery of task-specific, interpretable tokens.

In the realm of image captioning, models have been trained on image-text pairs to generate natural language descriptions of images. However, these captions often lack accuracy and specificity when dealing with new or unseen objects. To address this issue, researchers have utilized soft prompts to optimize text-guided diffusion models, enabling the generation of similar visual concepts present in the original image. Although this method is effective, the prompts are neither interpretable nor portable.

Taking inspiration from the binary networks community and their success in developing discrete optimizers for training neural networks with quantized weights, researchers adapt their lessons to refine and simplify discrete optimizers for language engineering. By building on existing gradient reprojection schemes, they developed a technique that learns hard prompts through continuous optimization.

Introducing Discrete Prompt Optimization for Generative Models

Example from original paper (source). These are two examples of hard prompt discovery. Input images are on the left and a text prompt is discovered to create the images on the right. Shades of gray represent token boundaries. 

In this article, we walk through a novel methodology for learning hard prompts by employing efficient gradient-based discrete optimization. The proposed method, which is call PEZ, combines the advantages of continuous soft-prompt optimization with the hard vocabulary constraints found in traditional hard prompt engineering techniques. The goal is to create an effective and easy-to-use approach for learning hard text prompts that can be automatically generated and optimized for various text-to-image and text-to-text applications.

The methodology requires several inputs: a frozen model (θ), a sequence of learnable embeddings (P = [e1, ..., eM]), where M is the number of "tokens" worth of vectors to optimize and d is the dimension of the embeddings, and an objective function (L). The discreteness of the token space is realized using a projection function (ProjE), which projects each individual embedding vector (ei) in the prompt onto its nearest neighbor in the embedding matrix (E|V|x d), where |V| is the model's vocabulary size.

Also defined is a  broadcast function (B), which repeats the current prompt embeddings (P) in the batch dimension (b) times. To learn a hard prompt, we minimize the risk (R(P0)) by measuring the performance of P on the task data:

R(P0) = E_D(L(θ(B(P, X)), Y)).

The proposed PEZ algorithm maintains continuous iterates, corresponding to a soft prompt. During each forward pass, we first project the current embeddings (P) onto the nearest neighbor (P0) before calculating the gradient. Then, using the gradient of the discrete vectors (P0), we can update the continuous/soft iterate (P).

By employing PEZ, we can optimize hard text prompts for both text-to-image and text-to-text applications. In the text-to-image setting, the method creates hard prompts for diffusion models, enabling API users to generate, discover, and mix and match image concepts without prior knowledge on how to prompt the model. In the text-to-text setting, it is demonstrated that hard prompts can be automatically discovered and are effective in tuning language models for classification tasks.

The remainder of this article is dedicated to discussing the detailed implementation of the PEZ algorithm, the experimental setup, and the empirical evaluation of our approach in various applications, along with potential future research directions in discrete prompt optimization for generative models. Let’s take a deep look at how we can write better prompts!

Learning Hard Prompts and Prompt Inversion with the CLIP Model

clip model architecture
CLIP Model

Let's dig deep into the key topic of this article: learning hard prompts and applying them using CLIP, a multimodal vision-language model. The goal is to develop a method that combines the advantages of existing discrete optimization methods with those of soft prompt optimization. By doing so, we aim to create an efficient, simple-to-use algorithm that optimizes hard prompts for specific tasks.

Learning Hard Prompts for CLIP

To learn hard prompts, first define an objective function and maintain a sequence of learnable embeddings. This sequence consists of a fixed number of token vectors that we want to optimize. During the optimization process, you can use a projection function to map the continuous embeddings to their nearest neighbors within the embedding matrix. This ensures that our prompts remain discrete and interpretable.

The PEZ optimization method combines the advantages of baseline discrete optimization methods with the power of soft prompt optimization. The key idea is to maintain a continuous iterate (soft prompt) during the optimization process while updating it with the gradient of the discrete vectors (hard prompt). This allows us to optimize hard prompts efficiently while leveraging the power of gradient-based methods.

Prompt Inversion with CLIP


The learning method proposed is well suited for multimodal vision-language models like CLIP. With these models, we can use PEZ to discover captions that describe one or more target images. Once they have these captions, they can be used as prompts for image generation applications.

Since CLIP comes with its own image encoder, it can be used as a loss function to drive our optimization process. We optimize hard prompts based on their cosine similarity to the image encoder, without the need for calculating gradients on the full diffusion model. This allows us to generate image captions that are optimized specifically for the task at hand.

For experiments, datasets like LAION, MS COCO, Celeb-A, and Lexica.art, which consist of diverse images from various sources are used. You can measure the quality of learned prompts through the semantic similarity between the original images and those generated using the prompts. The research paper experiments show that the prompts effectively capture the semantic features of the target images, and the generated images are highly similar to the originals.

COCO Dataset

Prompts are human-readable, containing a mix of real words and gibberish. However, the valid words included in the prompts provide a significant amount of information about the image. Interestingly, the optimization process may also include emojis, which seem to convey useful information while keeping the prompt concise. This demonstrates the power and flexibility of our optimization method in generating efficient hard prompts.

The method is also compared to other prompt engineering techniques, such as the CLIP Interrogator, which uses a large curated prompt dataset and a pre-trained captioning model. Results show that the method performs competitively, despite using fewer tokens and not relying on hand-crafted components.

Exploring the Impact of Prompt Length on Transferability: A Technical and In-Depth Analysis

A crucial aspect of prompt engineering is determining the optimal number of tokens for a prompt. The choice of prompt length significantly impacts the performance, generalizability, and transferability of the learned prompts. In this section, we present a more technical and in-depth analysis of how prompt length affects these factors.

The experiments involve analyzing the performance of prompts with varying lengths when generating images with diffusion models such as Stable Diffusion-v2. Researchers measured the quality of the prompts by calculating the semantic similarity between the original images and those generated using the prompts, as assessed by a larger reference CLIP model (OpenCLIP-ViT/G) not used during optimization.

PEZ Optimization outperforms many popular image generation setups with far fewer tokens
PEZ Optimization outperforms many popular image generation setups with far fewer tokens

The results show that longer prompts do not necessarily produce better performance in image generation tasks. In fact, long prompts tend to overfit to the specific task they are optimized for and demonstrate reduced transferability to other tasks or models. This overfitting phenomenon can be attributed to longer prompts capturing more intricate details of the target image, which may not generalize well to other images or contexts.

Upon analyzing the performance of prompts of different lengths, researchers empirically find that a length of 16 tokens strikes a balance between expressiveness and generalizability. For example, when comparing the performance of the PEZ method to the CLIP Interrogator with varying token lengths, we observe that reducing the token length for the CLIP Interrogator leads to a sharp drop in performance. In contrast, the PEZ method maintains competitive performance with shorter prompts, showcasing its robustness while using fewer tokens.

It is essential to note that even though models like Stable Diffusion and CLIP share the same text encoder, soft prompts do not transfer well compared to hard prompts. This finding reinforces the value of optimizing hard prompts to achieve both interpretability and transferability.

To summarize, understanding the impact of prompt length on performance and transferability is crucial for effective prompt engineering. By selecting an appropriate prompt length, you can enhance the generalizability and portability of learned hard prompts, enabling more versatile and efficient generation tasks across different models and domains. 

Exploring Style Transfer and Prompt Concatenation: A Comprehensive and Technical Examination

In this section, we delve deeper into the technical aspects of style transfer and prompt concatenation using our learned hard prompts. Both of these applications showcase the versatility and flexibility of our optimization method in generating efficient hard prompts for various image generation tasks.

Style Transfer

This PEZ method can be easily adapted for style transfer, a process that involves extracting shared style characteristics from a set of examples and applying the style to new objects or scenes. To achieve this, just follow a similar setting as investigated with soft prompts in Gal et al. (2022), but use the learned hard prompts instead.

Given several examples sharing the same style, you can optimize a hard prompt that captures the common style elements. Then use this prompt to apply the style to new objects or scenes. The results demonstrate that the method effectively embeds the shared style elements in the prompt and applies them to novel concepts, thus enabling successful style transfer.

Learned hard prompts for style transfer. Source

These examples show how the method can learn a hard prompt that captures the essence of a particular style and transfer it to entirely new scenes or objects while preserving the original style's characteristics.

Example prompts from the research paper to generate the above images. 

Prompt Concatenation

Prompt concatenation is another powerful application of learned hard prompts, where you combine the prompts for two unrelated images to create a new hybrid image. This process highlights the composability and flexibility of our learned hard prompts in generating intricate scenes.

Concatenated learned prompts. 

To perform prompt concatenation, we first generate prompts for two unrelated images using our optimization method. Next, we fuse the images by concatenating their prompts, creating a new prompt that combines the semantic features of both images. This new prompt is then used to generate a mixed image that incorporates elements from both original images.

These examples illustrate how the PEZ method can merge different concepts, such as painted horses on a beach and a realistic sunset in a forest, by concatenating their optimized hard prompts. The resulting mixed images demonstrate the ability of our method to create complex and diverse scenes by simply combining prompts.

In conclusion, style transfer and prompt concatenation serve as compelling examples of the many applications that can benefit from the PEZ optimization method for learning hard prompts. By optimizing discrete text and leveraging the power of gradient-based methods, you can create efficient hard prompts that enable versatile and flexible image generation tasks across various domains.

Unveiling the Intricacies of Prompt Distillation: A Comprehensive Exploration

Prompt distillation is an important application of the optimization method, focused on reducing the length of prompts while preserving their capability. In this section, we provide a more technical, in-depth analysis of the prompt distillation process and discuss its relevance, along with real examples from the research paper.

Distillation is particularly useful in situations where the text encoder of the diffusion model has a limited maximum input length, such as the CLIP model, which has a maximum input length of 77 tokens. Additionally, long prompts may contain redundant and unimportant information, especially when hand-crafted. Therefore, the goal is to distill the essence of the longer prompts, preserving only the essential information in a shorter, more efficient prompt.

To achieve prompt distillation, PEZ optimizes a shorter prompt to match the features of the longer prompt based on their text encoders. Given a target prompt's embedding P_target and a learnable embedding e, they modify our loss function as follows:

L = 1 - Sim(f(P_target), f(P))

Here, Sim denotes the similarity function between the text encoders f(P_target) and f(P). By minimizing this loss function, you can then learn a distilled prompt that captures the essential features of the longer prompt while using fewer tokens.

In the research paper, the authors present examples of images generated by the original prompts and the distilled prompts with four different distillation ratios: 0.7, 0.5, 0.3, and 0.1. These ratios represent the relationship between the length of the distilled prompt and the length of the target prompt. For instance, a distillation ratio of 0.1 means that the distilled prompt is only 10% the length of the original prompt.


The results show that even with only 3 or 4 tokens, the distilled hard prompts can still generate images that are very similar in concept to those produced by the original, longer prompts. This demonstrates the success of the prompt distillation process in creating shorter, more efficient prompts while maintaining their effectiveness in guiding image generation tasks.

To summarize the optimized prompt inversion with CLIP, learning hard prompts through optimization provides a powerful and flexible approach to prompt engineering. The laid out technique, which combines gradient-based optimization with discrete token selection, unlocks new possibilities for image generation, style transfer, prompt concatenation, and prompt distillation. This is an incredible push in the difficult domain of text-guided image generation. Prompt optimization techniques are very popular in the realm of NLP where leveraging things like OpenAi Evals and log probabilities makes it a bit easier to correlate outputs to specific features of inputs. Strides shown here are starting to bridge the gap. 

Learning Hard Prompts and Discrete Prompt Tuning with Language Models

In this deep dive, we will explore the application of learning hard prompts in the context of language models. We will focus on how the PEZ optimization method can be adapted for text-to-text tasks, enabling the discovery of effective prompts for language classification tasks.

Mastering the Art of Crafting Effective Prompts for Language Models with PEZ

When working with language models, the goal is to discover a discrete sequence of tokens (hard prompt) that will guide the language model to predict the outcome of a classification task. One important aspect of text is its fluency, which can improve both the readability and performance of a prompt.

To optimize hard prompts for language models, researchers define an objective function that consists of a weighted combination of task loss and fluency loss. By doing so, we can learn prompts that are not only effective in solving the task but also maintain a certain level of fluency for better interpretability.

The Method: Adapting PEZ for Language Models

Adapting the PEZ method for language models involves a few key steps. First, we choose a template and verbalizer for the task. The template is a sentence structure with placeholders for the input text and prompt, while the verbalizer maps logits to class labels. This helps in aligning the optimization process with the specific language classification task.

Next, we initialize the learnable embeddings, which are updated during the optimization process using the gradient of the discrete vectors (hard prompt). Similar to the image generation scenario, we use a projection function to map continuous embeddings to their nearest neighbors in the embedding matrix, ensuring that our prompts remain discrete and interpretable.

Finally, we optimize the hard prompts based on a weighted combination of task loss and fluency loss. This allows us to find prompts that are effective in solving the classification task while preserving fluency and interpretability.

Discrete Prompt Tuning with Language Models: Results and Applications

Prompt Transferability: Technical Insights and In-Depth Analysis

The ability to transfer prompts across different language models is a key advantage of the PEZ method. In the experiments, prompts were generated using GPT-2 Large for 5,000 steps. The top five prompts with the highest average validation accuracy for each technique were selected and tested on larger models. The models used for transferability testing included GPT-2 XL, T5-LM-XL, OPT-2.7B, and OPT-6B.

Gradient based approaches dominate. 

The research findings indicate that simply scaling a model—without additional training—does not guarantee that performance will scale accordingly. However, all gradient-based methods, including our PEZ method, were able to transfer prompts effectively compared to evaluating just the template. In particular, prompts trained with the fluency constraint (PEZ with fluency) transferred better than other methods.

For example, on the AGNEWS dataset, the PEZ method with fluency achieved a 14% increase in performance over the template baseline when transferred to the OPT-6.7B model. Furthermore, the AGNEWS prompts were able to transfer from GPT-2 Large to GPT-2 XL, showcasing the reliability of the method's transferability across different models.

It is worth noting that the transferability of prompts is not solely dependent on the optimization method, but also on the specific task and dataset. Nevertheless, our PEZ method with fluency consistently demonstrates an improved ability to transfer prompts across various language models, making it a valuable tool for prompt engineering in diverse natural language processing tasks.

Few-Shot Learning Reaps the Benefits

The PEZ method can also be applied in few-shot settings, where we have limited examples from each class to train the prompt. By optimizing the prompts using only a few samples, we can achieve high validation accuracy compared to other methods. The efficiency of the gradient-based approach enables fast exploration and discovery of novel prompts, making it an attractive option for prompt engineering in low-resource scenarios.

Qualitative Analysis

Results in the sentiment analysis domain compared to other methods. PEZ outperforms other methods on AGNEWS. Prompts are not standard english: “negative vibeThis immatureollywood MandarinollywoodThis energetic screenplay.”. This tells us the optimization can find words that help us reach our goal output but does not have very deep creativity. Source

Upon examining the top prompts generated by the method, we find that many of them are coherent and relevant to the classification task. For example, in the news classification task, some prompts include news sources like "BBC" or consider the text as coming from a blog, such as "Brian blog" or "Blog Revolution analyze." This demonstrates the potential of the PEZ method to discover interesting and interpretable prompts that can be used in various language tasks.

In summary, learning hard prompts through optimization offers a powerful approach for prompt engineering in the context of language models. By adapting the PEZ method for text-to-text tasks, we can discover effective and fluent prompts for language classification tasks that transfer well across different models and perform well in few-shot scenarios. This opens up new possibilities for harnessing the power of generative models in a wide range of natural language processing applications. 

Is This Worth Using? Key Benefits and Reasons to Adopt the PEZ Method

In conclusion, the PEZ method for learning hard prompts through optimization offers a powerful and versatile approach to prompt engineering in the context of both image generation and language models. Based on the results and insights from the research paper, we believe that this method is worth using for a variety of reasons:

1. Improved performance: The PEZ method consistently delivers competitive or superior performance in various tasks, such as image generation, sentiment analysis, and news classification. By combining the best traits of hard and soft prompt techniques, the method efficiently optimizes prompts for given tasks.

2. Transferability: One of the key advantages of the PEZ method is its ability to transfer prompts across different models. This is particularly useful when scaling up a model without additional training, as it allows the hard prompts to reliably boost performance.

3. Few-shot learning: The efficiency of the gradient-based approach enables fast exploration and discovery of novel prompts in low-resource scenarios. This makes the PEZ method a valuable tool for prompt engineering when only a few examples from each class are available.

4. Interpretability and fluency: By incorporating fluency constraints into the optimization process, this method generates prompts that are not only effective in solving the task but also maintain a certain level of fluency for better interpretability.

5. Flexibility and composability: The PEZ method can easily be adapted for various applications, such as style transfer, prompt concatenation, and prompt distillation. This highlights the versatility and adaptability of our learned hard prompts in different scenarios.

Considering these key benefits, we believe that the PEZ method for learning hard prompts is worth using in a wide range of generative model applications. By continuing to refine and improve these methods, we can further harness the potential of generative models in various image generation, natural language processing, and multimodal tasks.

Want to hire expert prompt engineers?

Width.ai builds custom GPT tools for some of the largest companies in the world. We’ve written 1000s of prompts and leverage awesome optimization tools (like the ones you see) to build production level systems with SOTA accuracy. Let’s schedule a time to talk about the prompt based products you want to build!