Turbocharge Dialogflow Chatbots With LLMs and RAG
Are Dialogflow chatbots still relevant when LLM-based chatbots are available? Yes. We'll show you how you can combine them to get the best of both worlds.
The fusion of large language models with image and video generation has given businesses the ability to use the visual medium to promote their businesses better. Sometimes, these may just be images created to go with your marketing content. In other cases, they may be crucial in demonstrating your business services. For example, an interior design business can let their customers speak or type what they have in mind and conjure up an entire matching 3D model of the interiors that they can visualize using virtual reality glasses.
Such use cases are made possible thanks to image generation AI models like DALL-E Mini. In this article, we explore its capabilities.
DALL-E Mini is an open-source AI image generation pipeline that produces photos or illustrations based on your text descriptions. It provides pre-trained text-to-image generation models that you can host on your servers and use for commercial or personal purposes.
DALL-E Mini is also available as a paid online service called Craiyon that's run by its designers. It uses a more recent and heavier version of DALL-E Mini called DALL·E Mega.
Yes! You can use a free to use version by Craiyon hosted on HuggingFace.
How does DALL-E Mini compare with OpenAI's popular service and which one should your business use? We tried out their image generation on some prompts that are relevant to our industry.
This prompt is a good gauge of model ability due to the multi-step complexity that is asked of the generation. We’ve defined a goal state object (car) in a set environment (freeway) with an idea of viewpoint and boundaries (mountains in background).
For this prompt, self-hosted DALL-E Mini generated this image:
It's OK as an illustration but the weird wheels and road make it unrealistic.
Craiyon, the hosted version of DALL-E Mini, generated these more realistic-looking images, though a couple of them look like illustrations:
DALL-E 2 generated the illustrations below for the same prompt. It appears that the word "realistic" in the prompt has no effect on it, and it probably prefers the word "photorealistic." Sensitivity to prompts is a well-known aspect of the service and has given rise to a cottage industry of prompt engineering and prompt marketplaces. This is a great example of DALL-E mini outperforming DALL-E 2 with a structured prompt.
For this prompt, DALL-E Mini generated a skyline but seems to have interpreted it as a view from a drone and so it doesn’t show a drone. Up to you on if given this prompt you’d want to include a drone in the image or not, but both models gave different suggestions for this. Even if you’re fine with both interpretations of the angle the Craiyon hosted model creates a much more complex image.
Craiyon generated these awesome-looking, evocative images that capture the futuristic intent of the prompt:
Base model DALL-E 2 disappointed with these cartoonish images:
It’s honestly surprising that DALL-E 2 produces images that don’t have the same level of quality as Mini with base prompts. You’d think prompt optimization would be needed on the Open Source model not OpenAi’s implementation.
For this prompt about the use of AI in law, DALL-E Mini generated this rather irrelevant-looking drawing:
Craiyon's images are visually better, but don't depict the lawyer concept at all:
DALL-E 2 generated these illustrations that capture the lawyer vibe a bit better:
In all three experiments, self-hosted DALL-E Mini did OK while Craiyon's DALL-E Mini generated better-looking and more evocative images compared to DALL-E 2. Keeping in mind that all the prompts were simple, didn't require any time to craft, and longer prompts don't help, we can say that the image-quality-to-prompt-simplicity ratio of Craiyon's DALL-E Mini is far higher.
If you want to provide a quick and satisfactory image generation service for simplistic users who don't have the time to craft prompts, Craiyon's DALL-E Mini is the better option. For more creative power users who have the time to craft long prompts, DALL-E 2 may be better.
Some other aspects to think about:
The crucial component in the DALL-E Mini pipeline is a sequence-to-sequence (seq2seq) decoder network based on the bidirectional and auto-regressive transformer model (BART).
While a regular BART network takes in text tokens and generates text tokens, DALL-E Mini's BART seq2seq decoder takes in the text of your prompt and generates semantically relevant image tokens based on them. Then a second network, a vector-quantized generative adversarial network (VQGAN), converts these image tokens into an image.
Overall, this pipeline consists of four separate components:
How they work during training and inference is explained in the next two sections.
DALL-E Mini used about 15 million caption-image pairs to train the crucial BART seq2seq decoder. For some datasets, the text caption for an image included both titles and descriptions.
For each pair, a VQGAN encoder generates image tokens from the image. It does this using a codebook that maps a set of image features to a token. Conceptually, this is like the visual-bag-of-words approach of traditional computer vision. The image features are obtained using a convolutional feature extractor.
Simultaneously, the BART encoder generates text embeddings for the caption, which is either a title or a description or their combination.
Next, the BART seq2seq decoder combines the image tokens and the embeddings and generates another sequence of image tokens based on both of them.
The generated sequence of image tokens is compared with the VQGAN's sequence using cross-entropy loss. Eventually, the seq2seq decoder learns to predict image tokens conditioned on text captions.
For inference, the pipeline receives just a text prompt and must generate images for it.
The BART text encoder again generates embeddings for the prompt. Because there's no input image, the VQGAN image encoder is not used during inference.
The seq2seq BART decoder takes in the text embeddings. Since it's trained to expect both text and image sequences, a dummy image token sequence consisting of just one token that marks the beginning of a sequence is supplied to get the prediction going.
The trained decoder, which is autoregressive, then predicts the next image tokens in the sequence, each based on some of the text embeddings as well as the previous image token. Its output is a sequence of image tokens.
Next, the predicted sequence of image tokens is converted to a set of probable images by the VQGAN decoder. Optionally, a pre-trained contrastive language-image pretraining model (CLIP) can be added to the pipeline to select the generated image that's closest to the given prompt.
In this experiment, we tested DALL-E Mini's image generation by ratcheting up the prompt complexity from simple and specific to very abstract.
Photos with people shown in identifiable ways are at risk of litigation from them even if the photographer has granted you a full license. This is a common problem with stock imagery. Obviously, getting permission from individuals in photos is impractical. But the risks are real, especially if you use that photo with content they don't approve of. Image generation is one option to avoid such permission and licensing issues.
Our first experiment is a simple prompt to show a person: "A girl playing golf, camera on the ground behind her."
This is what Craiyon's DALL-E Mini generates:
While its clearly a golf action shot from behind, we can see that DALL-E clearly struggles with hand placement relative to the golf club. This is a common problem with image generation models where things like hands and fingers are harder to generate interacting with objects in a way that looks real.
For medium difficulty, we ask it to generate something specific from a specialized domain like health care where getting images is sometimes difficult.
We use the following prompt: "A chest X-ray, lungs colored green, realistic."
The images generated are impressive. They're probably not anatomically correct but they can pass as X-ray images, especially coming from a base model. Fine-tuning this model on a dataset of these images would greatly improve these results.
The last prompt is an intentionally abstract one: "Document understanding using computer vision and natural language processing."
We were hoping for images that evoked ideas of AI and automation. Here are some of the selected results that do exactly that and resemble stock illustrations:
The images also look coherent despite the highly abstract prompt.
For your content marketing goals, you'd want to often generate relevant photos and illustrations for all your blog posts and white papers. Some typical business problems here are finding relevant imagery, licensing uncertainties, and prices of stock imagery. All these problems can be solved if you use a pipeline that examines the content and generates relevant photos or illustrations.
In this section, we experiment with an end-to-end automated pipeline that analyzes your content using natural language processing, generates an optimized prompt, and passes it to DALL-E Mini for image generation.
The automated pipeline consists of multiple components:
For a test run, we ran this pipeline on one of our articles, Using GPT-3 for Topic Extraction.
GPT-4 identified the following key topics in a sentence format:
We can take these key topics and process them through our image prompt generation model to create a more prompt focused language design. Craiyon DALL-E Mini came up with these images for this article.
Source: Craiyon
Next, we supplied the prompt and DALL-E Mini's initial image to the prompt optimization framework. The prompt optimization steps are shown below:
From there we were able to generate some more blog cover images. We finally see an image related to ideas around asset management and document processing. More focus during prompt optimization on negative words or fine-tuning this model would get rid of these poorly related images.
The final use case we explore is image creation for social media posts. From surveys and studies by social media companies, we know that visual content garners more eyes and responses. LinkedIn posts with images attract twice the comments and engagement compared to those without, and tweets with images or videos see three times as much engagement. Carousel posts (where you swipe and see up to 10 images in one post) get more interaction than both Reels and single-photo posts on Instagram feeds.
If you want to meet your KPIs for social media, you know that you need visual content. And it needs to come from somewhere.
We explore how DALL-E Mini image generation can be integrated into social media workflows.
The image generation pipeline for social media is similar to the one for blog posts and other long-form content shown above. But instead of a content fetcher, the first component is typically a social media publisher component that is given a post and distributes it to multiple social media simultaneously through their API.
Another difference is that social media posts may typically contain more slang, cultural references, and other informal language compared to content posts. Pretrained image generation models may stumble here because their language models were not trained on such language.
A final difference is that social media images typically need better moderation to avoid bans and prevent any reputational damage from images that may be misinterpreted.
All these differences translate into a need for creating a custom image generation model that's better tuned to social media's norms. In the next section, we explore how a self-hosted image generator like DALL-E Mini can be fine-tuned.
For this example I used this article that reads more like an informational blog post and less like the above case study post. We can see the model is getting closer to understanding exact ideas from the blog post such as “loan processing”. Fine-tuning this model would further the quality and precision.
Every component in the DALL-E Mini pipeline can be fine-tuned to suit your needs. Fine-tuning involves these general steps:
In the case of fine-tuning for social media, you can start by identifying caption-image pairs where the captions come from social media and the pipeline is not generating good images. Some reasons may include:
Gather a couple of hundred such problematic caption-image pairs and split them into training and test sets. Then consider the following fine-tuning possibilities:
DALL-E Mini fine-tuning typically requires a few hours to days of TPU time or its GPU equivalent.
In this article, you saw DALL-E Mini's capabilities as well as those of its alternatives. The ability of AI to generate any kind of photo or illustration makes it far more productive, and potentially less expensive, than relying on stock imagery, especially if you need highly domain-specific imagery that is rare.
If you want to systematically improve your marketing campaigns and content quality with customized images, contact us.