DALL-E Mini Guide for Generating Photos and Illustrations for Your Content

Matt Payne
June 13, 2023

AI art generated by Craiyon's DALL-E Mini
AI art generated by Craiyon's DALL-E Mini (Source: Craiyon)

The fusion of large language models with image and video generation has given businesses the ability to use the visual medium to promote their businesses better. Sometimes, these may just be images created to go with your marketing content. In other cases, they may be crucial in demonstrating your business services. For example, an interior design business can let their customers speak or type what they have in mind and conjure up an entire matching 3D model of the interiors that they can visualize using virtual reality glasses.

Such use cases are made possible thanks to image generation AI models like DALL-E Mini. In this article, we explore its capabilities.

What Is DALL-E Mini?

DALL-E Mini is an open-source AI image generation pipeline that produces photos or illustrations based on your text descriptions. It provides pre-trained text-to-image generation models that you can host on your servers and use for commercial or personal purposes.

DALL-E Mini is also available as a paid online service called Craiyon that's run by its designers. It uses a more recent and heavier version of DALL-E Mini called DALL·E Mega.

Is DALL-E Mini Available to the public?

Yes! You can use a free to use version by Craiyon hosted on HuggingFace.

DALL-E Mini vs. OpenAI DALL-E 2

How does DALL-E Mini compare with OpenAI's popular service and which one should your business use? We tried out their image generation on some prompts that are relevant to our industry.

Prompt 1: A Realistic Car on a Freeway With Mountains in the Background

This prompt is a good gauge of model ability due to the multi-step complexity that is asked of the generation. We’ve defined a goal state object (car) in a set environment (freeway) with an idea of viewpoint and boundaries (mountains in background).

For this prompt, self-hosted DALL-E Mini generated this image:

car generation with Dalle-e mini

It's OK as an illustration but the weird wheels and road make it unrealistic.

Craiyon, the hosted version of DALL-E Mini, generated these more realistic-looking images, though a couple of them look like illustrations:

photo-realistic ai art generations with Dalle-e mini
Source: Craiyon

DALL-E 2 generated the illustrations below for the same prompt. It appears that the word "realistic" in the prompt has no effect on it, and it probably prefers the word "photorealistic." Sensitivity to prompts is a well-known aspect of the service and has given rise to a cottage industry of prompt engineering and prompt marketplaces. This is a great example of DALL-E mini outperforming DALL-E 2 with a structured prompt.

cartoon ai art generations with dalle-e mini

Prompt 2: AI Drone Over New York, Illustration

For this prompt, DALL-E Mini generated a skyline but seems to have interpreted it as a view from a drone and so it doesn’t show a drone. Up to you on if given this prompt you’d want to include a drone in the image or not, but both models gave different suggestions for this. Even if you’re fine with both interpretations of the angle the Craiyon hosted model creates a much more complex image.

drone shot example

Craiyon generated these awesome-looking, evocative images that capture the futuristic intent of the prompt:

images that include drones that are generated with the above prompt
Source: Craiyon

Base model DALL-E 2 disappointed with these cartoonish images:

cartoonish dalle base model images that do not perform well

It’s honestly surprising that DALL-E 2 produces images that don’t have the same level of quality as Mini with base prompts. You’d think prompt optimization would be needed on the Open Source model not OpenAi’s implementation.

Prompt 3: Lawyer Robot, Pencil Sketch

For this prompt about the use of AI in law, DALL-E Mini generated this rather irrelevant-looking drawing:

pencil sketch lawyer robot

Craiyon's images are visually better, but don't depict the lawyer concept at all:

dalle-2 mini much better lawyer robot generations
Source: Craiyon

DALL-E 2 generated these illustrations that capture the lawyer vibe a bit better:

dall-e 2 generated images for lawyer robot


In all three experiments, self-hosted DALL-E Mini did OK while Craiyon's DALL-E Mini generated better-looking and more evocative images compared to DALL-E 2. Keeping in mind that all the prompts were simple, didn't require any time to craft, and longer prompts don't help, we can say that the image-quality-to-prompt-simplicity ratio of Craiyon's DALL-E Mini is far higher.

If you want to provide a quick and satisfactory image generation service for simplistic users who don't have the time to craft prompts, Craiyon's DALL-E Mini is the better option. For more creative power users who have the time to craft long prompts, DALL-E 2 may be better.

Other Aspects of Comparison

Some other aspects to think about:

  • Self-hosting vs. managed service: The DALL-E Mini models are completely free and licensed for any commercial use. You can host DALL-E Mini on your servers and pay for just the infrastructure. In contrast, DALL-E 2 is only available as a managed online service from OpenAI. DALL-E Mini is also available as a paid service via Craiyon, with various performance, deployment, and pricing tiers.
  • Custom data and fine-tuning: An important motivation for self-hosting DALL-E Mini is to be able to train or fine-tune custom models for your proprietary datasets and prompts, for example, in a different language, cultural environment, or domain-specific context. Craiyon's top-tier plan also supports custom models. OpenAI does not support any fine-tuning whatsoever for DALL-E 2. This is a key idea that came out of this research and comparison for us. Fine-tuning models is the best way to direct and “steer” any models towards your specific use case and goal state output images. Through fine-tuning you can teach the model exactly what you mean by “lawyer robot” or to include the drone based on the prompt. With DALL-E Mini already up to par or even better than DALL-E 2 in the base version, and you can fine-tune, it's clearly the most promising option when evaluating options.
  • Model size: DALL-E Mini weighs in at about 400 million parameters while DALL-E 2 reportedly has around 3.5 billion parameters and the first DALL·E model had 12 billion. This shows the efficacy of a lighter model to provide considerable value, but it also results in functional shortcomings.
  • Infrastructure requirements: In practice, you'll need at least one 12 GB or higher GPU to run DALL-E Mini, or alternatively, use a powerful multi-CPU server and rely on the JAX framework's distributed computing capabilities. One or two CPUs are just not enough — generating just one image can take 5-10 minutes. In contrast, DALL-E 2 can be run from any device since it's just a network request.
  • Pricing: Self-hosted DALL-E Mini mainly involves infrastructure pricing. Craiyon has monthly fixed pricing with unlimited images. OpenAI DALL-E 2 pricing is pay-as-you-go and resolution-based. For example, $5/month buys unlimited images on Craiyon and around 250 images (1024x1024) on DALL-E 2.
  • Price savings on stock imagery: The cost of DALL-E Mini or alternative image generation may be compensated by the expenses you save on stock imagery. So, the total cost of operations must be considered here.
  • Image resolution: The DALL-E Mini pre-trained model can only produce low-resolution 256x256 images by default. Higher resolutions require changing the model configuration and retraining it from scratch, which is not a trivial undertaking. Another solution is to add a super-resolution model to upscale the generated images. Both DALL-E 2 and Craiyon support higher resolutions up to 1024x1024 pixels.
  • Complex prompts: DALL-E 2 is responsive to long, complex prompts. In contrast, DALL-E Mini's designers state that it does not respond significantly differently to long prompts. If you have very particular image requirements, you may be better off with DALL-E 2. Once again, fine-tuning changes this for us and does allow our own implementation of DALL-E to learn these longer prompts.
  • Content moderation: Another reason for using DALL-E Mini is if OpenAI's content policies don't suit your business.
  • Bad at faces: DALL-E Mini-generated faces are far from ideal. You may need to add a face super-resolution model like SPGAN in the pipeline to improve them.
  • Older architecture: DALL-E Mini combines transformers and GANs. However, state-of-the-art image generation has shifted away from GANs in favor of diffusion because of the latter's ability to generate high-resolution, high-fidelity images using low resources. DALL-E 2 uses diffusion but the top tier are open-source diffusion models like Stable Diffusion and its customization techniques like LoRA.

DALL-E Mini Architecture and Working

The crucial component in the DALL-E Mini pipeline is a sequence-to-sequence (seq2seq) decoder network based on the bidirectional and auto-regressive transformer model (BART).

While a regular BART network takes in text tokens and generates text tokens, DALL-E Mini's BART seq2seq decoder takes in the text of your prompt and generates semantically relevant image tokens based on them. Then a second network, a vector-quantized generative adversarial network (VQGAN), converts these image tokens into an image.

Overall, this pipeline consists of four separate components:

  1. Image-to-token encoder: A VQGAN encodes an image as a set of semantically rich image tokens that can be supplied to transformer networks. The VQGAN encoder is only active during training.
  2. Text encoder: A BART encoder converts text descriptions (during training) and text prompts (during inference) to text embedding tokens.
  3. Seq2seq BART decoder: It generates a sequence of image tokens based on the text embedding tokens. It does so in an autoregressive mode, meaning that each generated image token depends on the previous image token to ensure visually consistent and meaningful images.
  4. Tokens-to-image decoder: A VQGAN decoder converts the BART decoder's sequence of image tokens into image patches that constitute the final image.

How they work during training and inference is explained in the next two sections.

DALL-E Mini Training

DALL-E Mini used about 15 million caption-image pairs to train the crucial BART seq2seq decoder. For some datasets, the text caption for an image included both titles and descriptions.

VQGAN image encoding and image token decoding for dall-e mini
VQGAN image encoding and image token decoding (Source: Esser et al.)

For each pair, a VQGAN encoder generates image tokens from the image. It does this using a codebook that maps a set of image features to a token. Conceptually, this is like the visual-bag-of-words approach of traditional computer vision. The image features are obtained using a convolutional feature extractor.

Simultaneously, the BART encoder generates text embeddings for the caption, which is either a title or a description or their combination.

Next, the BART seq2seq decoder combines the image tokens and the embeddings and generates another sequence of image tokens based on both of them.

DALL-E Mini training pipeline
DALL-E Mini training pipeline (Source: Dayma et al.)

The generated sequence of image tokens is compared with the VQGAN's sequence using cross-entropy loss. Eventually, the seq2seq decoder learns to predict image tokens conditioned on text captions.


DALL-E Mini inference pipeline for use
DALL-E Mini inference pipeline (Source: Dayma et al.)

For inference, the pipeline receives just a text prompt and must generate images for it.

The BART text encoder again generates embeddings for the prompt. Because there's no input image, the VQGAN image encoder is not used during inference.

The seq2seq BART decoder takes in the text embeddings. Since it's trained to expect both text and image sequences, a dummy image token sequence consisting of just one token that marks the beginning of a sequence is supplied to get the prediction going.

The trained decoder, which is autoregressive, then predicts the next image tokens in the sequence, each based on some of the text embeddings as well as the previous image token. Its output is a sequence of image tokens.

Next, the predicted sequence of image tokens is converted to a set of probable images by the VQGAN decoder. Optionally, a pre-trained contrastive language-image pretraining model (CLIP) can be added to the pipeline to select the generated image that's closest to the given prompt.

Basic Image Generation With DALL-E Mini

In this experiment, we tested DALL-E Mini's image generation by ratcheting up the prompt complexity from simple and specific to very abstract.

Simple Prompt

Photos with people shown in identifiable ways are at risk of litigation from them even if the photographer has granted you a full license. This is a common problem with stock imagery. Obviously, getting permission from individuals in photos is impractical. But the risks are real, especially if you use that photo with content they don't approve of. Image generation is one option to avoid such permission and licensing issues.  

Our first experiment is a simple prompt to show a person: "A girl playing golf, camera on the ground behind her."

This is what Craiyon's DALL-E Mini generates:

A girl playing golf, camera on the ground behind her.
Source: Craiyon

While its clearly a golf action shot from behind, we can see that DALL-E clearly struggles with hand placement relative to the golf club. This is a common problem with image generation models where things like hands and fingers are harder to generate interacting with objects in a way that looks real.

Medium Difficulty Prompt

For medium difficulty, we ask it to generate something specific from a specialized domain like health care where getting images is sometimes difficult.

We use the following prompt: "A chest X-ray, lungs colored green, realistic."

A chest X-ray, lungs colored green, realistic: Dalle prompt
Source: Craiyon

The images generated are impressive. They're probably not anatomically correct but they can pass as X-ray images, especially coming from a base model. Fine-tuning this model on a dataset of these images would greatly improve these results.

Very Abstract Prompt

The last prompt is an intentionally abstract one: "Document understanding using computer vision and natural language processing."

We were hoping for images that evoked ideas of AI and automation. Here are some of the selected results that do exactly that and resemble stock illustrations:

abstract dalle prompt for image generation

The images also look coherent despite the highly abstract prompt.

DALL-E Mini to Generate Images for Blog Posts

For your content marketing goals, you'd want to often generate relevant photos and illustrations for all your blog posts and white papers. Some typical business problems here are finding relevant imagery, licensing uncertainties, and prices of stock imagery. All these problems can be solved if you use a pipeline that examines the content and generates relevant photos or illustrations.

In this section, we experiment with an end-to-end automated pipeline that analyzes your content using natural language processing, generates an optimized prompt, and passes it to DALL-E Mini for image generation.

Width.ai Pipeline

Automated image generation pipeline. Replace "Run Stable Diffusion" with "Run DALL-E Mini". They're the same pipeline!

The automated pipeline consists of multiple components:

  1. Chunking Algorithm: The chunking algorithm focuses on breaking down our blog post into more bite sized pieces with stronger contextual understanding. We use a mix of long document chunking strategies to reach a final outline of how we want to split this specific blog post. Key topic extraction (which is what we do in the next step) is one of the use cases where maintaining an understanding of context of the other chunks greatly reduces the chance the model produces very generalized topics or focuses on the wrong topics, so the chunking strategy we use should focus on minimizing these cases.
  2. GPT-4 topic extraction: Here we leverage our key topic extraction model to extract the key idea of the blog post chunk. This is not an “extractive” model that pulls out specific sentences or paragraphs, but tries to abstract the section into a single topic with references. Included in this module is an NLP focused prompt optimization framework to improve our prompt at runtime.
  3. Combinator Model: This fine-tuned model focuses on combining all of our chunk key topic outputs. The goal of condensing these topics is to remove duplicates or semantically overlapping topics, and to reduce our text input to a prompt size for running through our model. While DALL-E mini does not have a prompt token limit, its best to keep prompts to DALL-E or Stable Diffusion limits as adjectives closer to the beginning of the prompt have highest priority. This is a similar idea to the early prompt biases seen in models like GPT-3. While this section does not give us our final prompt it does give us a middle ground between our multiple paragraphs of key topics from all the chunks to a version we can work with much easier.
  4. Image Prompt Generation Model: We set up a module to go from a simple topic focused sentence that breaks down what our blog post talks about to a domain focused hard prompt. We optimize our text prompts for accuracy based on the principles outlined in articles we’ve written such as learning hard prompts & keyword focused optimization. We use a prompt optimization framework here as well to give us just a bit more steering towards examples of optimized prompts from our other blogs. Another key output we can train for at this step is what negative words to use as well to further guide our model towards our goal state output.  
Prompt optimization example with CLIP using PEZ
Prompt optimization example with CLIP
  1. DALL-E Mini image generation: We combine the six key topics into a prompt and supply it to DALL-E Mini for initial image generation.


For a test run, we ran this pipeline on one of our articles, Using GPT-3 for Topic Extraction.

GPT-4 identified the following key topics in a sentence format:

  • GPT-3 & Ai
  • Asset Management
  • NLP
  • Interview Transcripts
  • Topic Extraction Models
  • Financial Interviews

We can take these key topics and process them through our image prompt generation model to create a more prompt focused language design. Craiyon DALL-E Mini came up with these images for this article.  

example image generations for the blog post

Source: Craiyon

Next, we supplied the prompt and DALL-E Mini's initial image to the prompt optimization framework. The prompt optimization steps are shown below:

prompt optimization framework

From there we were able to generate some more blog cover images. We finally see an image related to ideas around asset management and document processing. More focus during prompt optimization on negative words or fine-tuning this model would get rid of these poorly related images.

hard prompt optimization for dalle

Image Creation for Social Media Posts With DALL-E Mini

The final use case we explore is image creation for social media posts. From surveys and studies by social media companies, we know that visual content garners more eyes and responses. LinkedIn posts with images attract twice the comments and engagement compared to those without, and tweets with images or videos see three times as much engagement. Carousel posts (where you swipe and see up to 10 images in one post) get more interaction than both Reels and single-photo posts on Instagram feeds.

If you want to meet your KPIs for social media, you know that you need visual content. And it needs to come from somewhere.

We explore how DALL-E Mini image generation can be integrated into social media workflows.

Social Media Pipeline

The image generation pipeline for social media is similar to the one for blog posts and other long-form content shown above. But instead of a content fetcher, the first component is typically a social media publisher component that is given a post and distributes it to multiple social media simultaneously through their API.

Another difference is that social media posts may typically contain more slang, cultural references, and other informal language compared to content posts. Pretrained image generation models may stumble here because their language models were not trained on such language.

A final difference is that social media images typically need better moderation to avoid bans and prevent any reputational damage from images that may be misinterpreted.

All these differences translate into a need for creating a custom image generation model that's better tuned to social media's norms. In the next section, we explore how a self-hosted image generator like DALL-E Mini can be fine-tuned.

For this example I used this article that reads more like an informational blog post and less like the above case study post. We can see the model is getting closer to understanding exact ideas from the blog post such as “loan processing”. Fine-tuning this model would further the quality and precision.

social media post generation with dalle

Fine-Tuning DALL-E Mini Image Generation

Every component in the DALL-E Mini pipeline can be fine-tuned to suit your needs. Fine-tuning involves these general steps:

  • Analyze the existing weaknesses in the stock models, and identify which ones need to be improved.
  • Collect data that can make up for those weaknesses.
  • Freeze the weights of the initial layers of the target model and unfreeze the remaining top layers.
  • Train the identified model(s) on the newly collected data. This will modify the weights in the unfrozen layers to better respond to the new data.

In the case of fine-tuning for social media, you can start by identifying caption-image pairs where the captions come from social media and the pipeline is not generating good images. Some reasons may include:

  • Use of languages other than English.
  • Use of local cultural references that don't exist in the pre-trained models.
  • Use of slang, abbreviations, emojis, and similar.
  • Use of domain-specific specialized terms that aren't included in the pre-trained models.
  • Use of abstract or domain-specific images that the pre-trained models have never seen.

Gather a couple of hundred such problematic caption-image pairs and split them into training and test sets. Then consider the following fine-tuning possibilities:

  • VQGAN: Run a round of encoding followed by decoding operations on these test images using the existing VQGAN model. If its decoded images are too abstract, irrelevant, or different from the input images, it requires fine-tuning. Fine-tune the VQGAN encoder and decoder on those test images to improve their ability to encode and decode them.
  • BART text encoder: It's difficult to verify if the BART encoder is messing up the embeddings. The safe strategy is to assume it is, especially if there are lots of language differences in the posts, and retrain it anyway.
  • Seq2seq BART decoder: This model always requires fine-tuning because it's the one most likely to fail on novel data. Fine-tune its final layers on the text embeddings and image tokens supplied by the fine-tuned VQGAN and BART encoder.

DALL-E Mini fine-tuning typically requires a few hours to days of TPU time or its GPU equivalent.

DALL-E Mini Image Generation for Your Business Needs

In this article, you saw DALL-E Mini's capabilities as well as those of its alternatives. The ability of AI to generate any kind of photo or illustration makes it far more productive, and potentially less expensive, than relying on stock imagery, especially if you need highly domain-specific imagery that is rare.

If you want to systematically improve your marketing campaigns and content quality with customized images, contact us.