A Deep Dive Into The Mask-Aware Transformer for Large-Hole Image Inpainting | How Should I Use MAT?

February 7, 2023

*Enhance the quality of noisy or poor quality product data (Original Photo by* *Franki Chamaki* on *Unsplash*)

Images and videos are indispensable to expanding your brand reach. But finding the most suitable image or video remains a hassle for businesses. Thankfully, since around 2020, image generation models like DALL-E 2, Stable Diffusion, and Midjourney have produced incredibly breathtaking images that often surpass even human photography and art in creativity and realism. The best thing about them is that many of them are available to you too as open-source code and models.

In this article, you'll get to know one such exciting and extremely useful generative AI use case in depth — the mask-aware transformer for large-hole image inpainting.

What Is Image Inpainting?

Image inpainting refers to generating a portion of an existing image in a way that is visually realistic and semantically consistent with the rest of the image. You may need it to seamlessly remove something from an image or add something to it. Or you may use it to repair a damaged old photo.

Business Uses for Inpainting

What's inpainting useful for? Apart from image editing in general, it may prove useful for the following possibilities.

1. Generate New Product Images for Product Recognition Models

You may be an ai retail software company looking for new ways to generate retail shelf product data. Or you may be a retail business experimenting with your product placement or planogram compliance. Either way, you can use guided inpainting to experiment with different aspects like positioning, colors, shapes, or sizes of your products.

*Imagining different product colors, shapes, sizes, and placement in a retail setting using inpainting (Photo by* *Franki Chamaki* on *Unsplash*)

‍

For example, in the image above, text-guided inpainting is used to imagine how a novel packaging or new product may appear on a shelf.

2. Product Photo Generation

Inpainting allows you to generate a full image background around product photos from a picture of your product that submerge the product naturally in the background. We can build simple pipelines that take our product photo, do the preprocessing required, then generate a background that actually puts the product in the photo! Through high level prompting we can make angles, other objects, and the surface map to the product.

Product photo generation with inpainting

Product blends into the background naturally. The image around the product adjusts angles and viewpoint align with product.

3. Enhance Product Image Quality

You can use inpainting for background modification or super-resolution to enhance poor resolution images.

4. Customize Stock Images for Your Content and Websites

If you aren't good at image editors like Photoshop, you can use inpainting to modify an existing stock image or product image to suit your content or website design. This process also allows you to automate the process of generating these stock images for specific blogs. You can even leverage a key topic extraction model to auto generate the prompts for you specific to your blog content.

5. Synthesize Non-Existent Human Faces to Avoid Legal Problems

Many businesses wrongly assume that stock photos with permissive licenses are risk-free. In reality, if a photo shows a person's face in an identifiable way, those permissive usage rights don't automatically extend to the use of their face. They can sue you if you don't get their explicit permission, especially if you use the photo alongside content that is controversial or socially damaging to the person.

Inpainting offers a solution by being able to generate non-existent human faces. The mask-aware transformer explained in this article even provides a pre-trained model to demonstrate this (but only for non-commercial purposes).

6. Photo Restoration

Old, worn out, or damaged photos are common in some organizations like ancestry services, libraries, law enforcement, and government archives. Inpainting can be used to restore such photos.

7. Drone, Aerial, and Satellite Imagery

Due to clouds, communication problems, birds, or other obstacles, missing areas are a common problem in any industry that uses drone, aerial, or satellite imagery. Inpainting is a solution that's commonly used and quite effective since there are likely to be other images of the same areas that can be used to fill in the gaps.

Problems of Large Hole Inpainting With Traditional Architectures

For small holes or minor damage, traditional image processing and inpainting techniques work fine. They just spread colors, textures, or contours from the neighborhood and that's often enough.

*Failure of large-hole inpainting when using traditional inpainting algorithm (Photo by* *Franki Chamaki* on *Unsplash*)

But large-hole inpainting is an entirely different beast despite the similar name. It requires semantic understanding of images and realistic image generation, not just the flow of colors and textures. The image above demonstrates the failure of large-hole inpainting using a traditional algorithm. That's why most of the recent advances have used deep learning techniques for inpainting. The next section introduces you to some of these approaches.

Deep Learning Approaches for Inpainting

Most deep learning approaches for inpainting fall into one of these architectural categories:

Generative adversarial networks (GAN): A GAN consists of two networks — a generator and a discriminator — that are trained against each other like adversaries. The generator tries to produce realistic inpainting that the discriminator can't detect as generated. The discriminator attempts to detect whether the image contains synthetic inpainting. Their competition to get the better of each other results in a generator that can produce highly realistic inpainting. This is a very powerful architecture for more difficult use cases such as videos or intense, high quality images.
Diffusion models: These models use a concept called diffusion to progress toward the target image step-by-step. This is the most common method of inpainting that is seen in programs like Dreambooth, Stable Diffusion's Dream Studio, and OpenAI's DALL-E. Most simple inpainting use cases can be solved with fine-tuned versions of these architectures.
Language-image diffusion models: These models do the inpainting guided by text prompts from the user. For example, stable diffusion's models accept text prompts to guide the inpainting in a selected region.
Transformer networks: A transformer network consists of an encoder that generates latent representations for the target area and a decoder that generates pixels from those latent embeddings. Transformers may be used standalone but are often part of GANs, diffusion models, and language-image models too.
Convolutional neural networks (CNN): A CNN specializes in any computer vision task where local area features are critical, and that includes inpainting. CNNs may be used standalone but are often part of GANs and diffusion models too. U-net is a particularly popular model due to its ability to generate high-resolution images.
Autoencoders: Some diffusion models use these networks to obtain the latent representations.

In the next section, you'll learn about the mask-aware transformer, a recent approach that performs well at inpainting.

Mask-Aware Transformer

The mask-aware transformer (MAT) is a GAN with some innovative ideas to overcome the problems of large-hole inpainting. As of January 2023, it's the best-performing model on several inpainting benchmarks like Places2 and CelebA-HQ.

We'll dive deep into how it works.

Key Intuition Behind This Approach

A region on the inner border of a hole is strongly influenced by its local neighborhood outside the hole. The neighbors' colors, textures, contours, and other visual aspects must flow smoothly into the region. Each such border region, in turn, influences its neighbors inside the hole.

This hierarchy of influence is analogous to text generation where the sequence of previous words influences the next word, the sequence of previous sentences influences the next sentence, and so on.

Additionally, long-range and global context must also influence the objects and textures painted in a hole. If the hole is in a human face, its filling must match the parts of a typical human face at that position.

Paying attention to both local and long-range context for visual and semantic consistency is something transformer models excel at. So MAT uses a transformer stack in its generator and makes it mask-aware.

High-Level Architecture

Like any GAN, MAT has a generator and a discriminator network. The discriminator is only active during training to force the generator towards ever-greater accuracy.

MAT's generator network consists of:

A convolutional block for token generation
A transformer stack for using local and global context
A U-net convolutional stack for refining and upsampling generated images
A styling module to influence the generation

The discriminator network is a standard one similar to StyleGAN2. It's a convolutional network with two stages of skip-connected layers — one for the real images and one for the generated images. It evaluates the perceptual loss between real images and those synthesized by the generator network.

Since the generator network is the more interesting part of MAT, we'll study it in detail next.

Token Generation Using Convolutional Layers

The first stage of the generator generates the inputs for the transformer stack. These inputs are called "tokens," a term from language processing where transformers are popular. The tokens here are just visual feature embeddings generated on the input images by a stack of six convolutional layers. For a 512x512 image, they generate 4096 tokens of 180 dimensions.

Long-Range Knowledge and Mask Awareness Using Contextual Attention Transformer

MAT's transformer stack is designed to allow both local and long-range context to influence the inpainting and do so efficiently.

A standard transformer block has a multi-head self-attention layer and a feed-forward layer with layer normalization at their inputs. The self-attention layer decides which parts of the input to focus on. All this works fine on a normal image.

But on an image with large holes (due to a mask or damage), many tokens may be from blank patches with no useful information. The influence of these "invalid tokens" gets amplified by the standard self-attention mechanism, the use of skip connections, and layer normalizations. So the training becomes unstable and the model's generative capability goes down.

Plus, since attention's complexity is quadratic (i.e., given N inputs, it must do NxN dot products), some optimizations are needed anyway for high-resolution inpainting.

To solve all these problems, MAT replaces standard self-attention with contextual attention and normal transformer blocks with adjusted transformer blocks.

Multi-Head Contextual Attention (MCA)

The MCA is a special multi-head attention layer. Using the shifted windows idea from the Swin transformer, it can extract useful information even when the number of usable tokens is low due to large holes.

It also maintains a dynamic mask to evaluate masked areas and retain only the useful tokens. It works like this: The input mask is the initial mask. After each attention round, it shifts the attention window by a few tokens and starts the next round. If a window has at least one usable token, the entire window is treated as usable and included in the dynamic mask. But if it has only invalid tokens, the window is discarded from the dynamic mask. In the end, you're left with a dynamically determined mask that only contains useful information.

Adjusted Transformer Block

These MCA layers are used by the so-called adjusted transformer blocks that are optimized for large-hole images. Unlike a standard block whose use of skip connections and layer norms amplifies the invalid (useless) token problem, an adjusted block is optimized for sparse tokens. It doesn't use skip connections or layer norms. Instead, it concatenates the inputs and outputs of each attention layer and sends them to a fully connected layer.

MAT's transformer stack consists of five such adjusted transformer blocks that progressively downsample the number of tokens and upsample them back to 4096 tokens, decorating them with attention information about which tokens to focus on.

The stack is followed by one convolutional decoder layer with a single global skip connection to the stack's input. Its job is to decode the transformer's latent representations back to visual feature embeddings but now with attention. They're the final outputs of the transformer stack and the inputs to the next U-net stage.

High Resolution Final Image Using U-Net

U-net architecture — *U-net (Source:* *Ronneberger et al.*)

A U-net is a convolutional neural network (CNN) with a special capability: Though it applies convolutions and downsampling like any other CNN, it finally produces a full image with the same resolution as the original. It does this by progressively upsampling feature maps from a previous layer using information from the corresponding downsampling layer.

The U-net in MAT's generator takes the embeddings from the transformer stack and combines them with styling inputs. Five convolutional layers then progressively downsample the feature embeddings, halving them in each layer. Then five more layers progressively upsample the embeddings back to a full-resolution color image.

Styling Images

Style variations with the Mask-Aware Transformer — *Style variations (Source:* *Li et al*.)

For some use cases, you may want multiple inpainted versions showing a variety of plausible objects, all of them visually and semantically consistent with the rest of the image. For example, for a generated garden photo, you may want multiple photos that show flowers of different colors to choose from. Or if your photo has people's faces, you may want to examine many faces and select the one best suited to your brand or content.

MAT does not give you tight control over what gets inpainted. Instead, it learns what kind of random noise will result in specific visual attributes in the final image. During training, supply the noise values alongside the images with the specific visual features you want. The noise-to-style mapping is learned in the transformer stack and then used by the U-net during generation.

When you want it to generate multiple variants, just supply the appropriate noise values. The noise values are abstract vectors but you can always map them to specific image features in your user interface. For example, if you're inpainting faces, you can map specific noise vectors to facial aspects like hair color, skin color, and so on.

Training

MAT training is self-supervised and adversarial. You're free to use unlabeled image datasets. But if labels are available, you can tell MAT to optionally use them for loss calculations.

When should you train a new model? Pretrained models are only available for faces and places. The places dataset covers a wide variety of indoor and outdoor settings and MAT will work fine on any photo with similar settings. But if you want it to generate specific types of vehicles or store products or dresses or similar, you'll have to train a custom model on your own images.

Evaluation and Metrics

Metric comparison for Mask-Aware Transformer vs other architectures commonly used — *Metrics comparison (Source:* *Li et al*.)

Evaluating inpainting is complicated because many plausible objects can be painted and seem realistic. So evaluation is a mix of perceptual metrics and qualitative assessments.

MAT uses three perceptual metrics:

Fréchet inception distance (FID): The FID compares the activations of deep layers of a pre-trained Inception network between real and generated images. A lower FID score indicates that the generated images are more similar to real images, while a higher FID score means that the generated images are less similar to real images.
Paired and unpaired inception discriminative scores (P-IDS/U-IDS): These metrics measure the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space.

MAT scores well on all three metrics compared to other inpainting GAN and CNN models.

Benefits of Mask-Aware Transformer

Compared to some of the other GAN models for inpainting, MAT comes with some great benefits.

1. Visually realistic generation: MAT's inpainting produces more realistic images compared to other recent models like CoModGAN, LaMa, or MADF.

Qualitative comparison with MAT vs other models — *Qualitative comparison with other models (Source:* *Li et al*.)

2. High-resolution, high-quality inpainting: Most inpainting approaches compromise either the resolution or the quality. But thanks to its unique innovations, MAT achieves both high resolution and high quality. It defaults to 512x512 pixel images but can scale up to any higher resolution as long as it's a multiple of 512.

3. Fast execution: MAT's generator is fast. It's capable of inpainting high-resolution images in near real-time. MAT’s ability to be used in your own infrastructure as opposed to managed services allows you to scale the speed with cloud computing.

4. Large hole inpainting: Of course the key benefit of this architecture is the ability to inpaint accurately over a large mask area.

Mask-Aware Transformer vs. DALL-E 2

DALL-E 2 is OpenAI's latest text-drive image generation model. It’s the most commonly used architecture for in-painting related tasks, as it’s a simple to use managed service that lets you get going quickly. Let's see how it does on inpainting.

Demo of DALL-E 2 Inpainting

The demo below shows DALL-E 2 inpainting on the building, guided by the text prompt "fill with surrounding windows." It generates five variants ranging from normal to rather creative. Testing with simpler text prompts helps us evaluate across a wider range of data variance that users could provide. Products that use inpainting or outpainting often use simpler text prompts for this same reason.

Demo of using DALL-E for inpainting — *DALL-E 2 demo (Original Photo by* *Niklas Stumpf* on *Unsplash*

How Does DALL-E 2 Work?

*DALL-E 2 unCLIP architecture (Source:* *Ramesh et al*.)

DALL-E 2 isn't a GAN like MAT. Instead, it builds on two prior OpenAI models.

The first is contrastive language-image pretraining (CLIP). It's a transformer-based language-image model that learns to associate text semantics and visual concepts in images through joint contrastive training.

DALL-E 2 has access to those latent representations for text and images that CLIP learns. All it needs now is a way to reuse them to generate new images. Generating images from latent embeddings is nothing new — MAT and other GANs do it too. But DALL-E 2's novelty is that it doesn't use a GAN at all! Instead, it uses an even better approach called diffusion.

Diffusion models synthesize images from noise by progressively removing the noise. DALL-E 2's diffusion model is additionally conditioned on CLIP and text embeddings to generate its images using a decoder model called unCLIP.

MAT vs. DALL-E 2

Let's see some differences between MAT and DALL-E 2, both functional and technical.

1. Use MAT for Realism and DALL-E 2 for Creativity

MAT's inpainting automatically leans towards realism if you've trained it only on real-world images. It usually doesn't generate anything too creative or unexpected, and that's a good thing for most business use cases.

In contrast, DALL-E 2 often feels a bit overenthusiastic in creativity. Even plain simple prompts like "fill the holes with the table" can result in unexpected new objects being added. You'll need multiple cycles of prompting and erasing to make it do what you want. This can be very difficult to account for in SaaS product workflows that want to leverage this functionality. Each time a user has to think of a new prompt to generate the output they think is correct, they become less likely to keep trying to generate.

2. MAT Is a Pure Vision Model While DALL-E 2 Is a Language-Image Model

MAT is a pure vision model. Its only inputs are images. In contrast, DALL-E 2 is a text-guided image generator and text is a key part of its inputs.

3. MAT Is a GAN While DALL-E 2 Is a Diffusion Model

As the DALL-E 2 paper points out, GANs have limited variability because adversarial learning can't model complex distributions easily. They're also prone to training instability and have more parameters.

In contrast, diffusion models provide wider distribution coverage and easy scalability. They are also comparatively lightweight.

Mask-Aware Transformer vs. Stable Diffusion

Stable diffusion (SD) is an open-source implementation of image generation using a diffusion model. It enables you to generate images in the same way as commercial cloud services like DALL-E 2 or Midjourney. SD is also capable of inpainting on existing images. It's maintained by a consortium of academic and industry groups like CompVis at the Ludwig Maximilian University of Munich, Stability AI, and Runway among others.

Essentially, stable diffusion consists of an open-source framework, pre-trained models, self-managed application programming interfaces (APIs), cloud-managed paid APIs, and end-user graphics applications like DreamStudio.

Demo of Stable Diffusion Inpainting

Here's a quick demonstration of inpainting using pre-trained stable diffusion models. We used the DreamStudio application but you can also automate all this using SD's APIs. You'll notice that this is not considered large-hole inpainting as the area is not very large.

Great demo of using Stable Diffusion to add products to a retail shelf — *Generate imaginary "pink coca cola bottles shaped like milk cartons" on a store shelf (Photo by* *Franki Chamaki* on *Unsplash*)

Introduction to Stable Diffusion's Approach to Inpainting

*Stable diffusion architecture (Source:* *Rombach et al*.)

Stable diffusion is also a diffusion-based image synthesis model. It's similar to DALL-E 2 in approach but its architectural choices are different. The biggest difference is that stable diffusion prefers to do its diffusion-based denoising in a more compact latent space rather than the computationally intensive pixel space, which makes it a latent diffusion model (LDM).

SD's LDM consists of a latent diffusion autoencoder that learns to apply denoising directly to the embeddings. Essentially, SD uses the data from incremental noising to learn its reverse process. In doing so, it learns to identify both the noise itself and the probable noising step, given a noised embedding. Through repeated step-by-step denoising, it separates all the noise from the underlying image embedding and produces a "clean" embedding.

It implements this denoising using a U-net convolutional network that also has cross-attention layers in it. The use of attention blocks enables the denoising process to be conditioned to additional modalities, such as text embeddings.

How can you use text prompts to guide this denoising? The LDM described above is an unconditional diffusion model because it can denoise only the latent embeddings from images. However, the U-net's cross-attention mechanism provides a hook to inject other modalities and turn it into a conditional diffusion model.

For text-guided image generation, it uses a text-to-image conditional LDM. A language-image transformer encoder, similar to OpenAI's CLIP, is first trained contrastively on the LAION-400M image-text pairs dataset. The encoder's text and image embeddings are injected into the denoising U-net via its cross-attention blocks. The denoising is now conditioned on the text and image embeddings and can be guided through text prompts. This is what is shown in the above demo.

MAT vs. Stable Diffusion

Many of the things said about MAT versus DALL-E 2 apply to stable diffusion too. But there are some unique differences and similarities too.

1. Self-Managed Deployment

A major similarity — one that may be of interest to your data security and budget — is that stable diffusion is open-source, just like MAT, and can be optionally deployed on your own servers for data security and privacy reasons.

But you also have the option of using Stability AI's commercial cloud APIs instead.

2. Latent Diffusion Is Much More Efficient Than Regular Diffusion

Latent diffusion in latent space is magnitudes more computationally efficient than regular diffusion that denoises in the image space. You need fewer GPUs to run SD. In fact, you can even run it on your personal Mac or iPhone!

3. Stable Diffusion Allows Any Modality

The use of cross-attention in the denoising U-net enables you to use any modality to influence the image generation. Text is the most obvious one. But nothing stops you from using other modalities like:

Videos for synthesizing new frames of a different type. For example, you can generate real-life frames from a cartoon video or vice versa.
Audio or music for audio-image synthesis that may be of interest to the music industry or music events.
Speech signals for speech-image synthesis that may be useful to the hearing-impaired.
Medical sensors for physiological signal-to-image synthesis that may help doctors visualize the working of the body in new ways and perhaps spot problems or improve diagnoses.
Any IoT sensor signal-to-image synthesis for innovative domain-specific visualization. For example, sensors on VR devices like the Oculus Rift or Nextmind can be used to generate and show novel worlds.

Start Using Mask-Aware Transformers, Image Inpainting, and More

The generative ai revolution has produced an incredible list of automated capabilities that your business can use by leveraging GPT-3, Stable Diffusion, MAT, and many others. Automated creation of customized photos and videos, thought to be too complex and creative for machines just three years ago, are now available at your fingertips. Contact us to talk about how we can build you generative ai products such as these!

References

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, Jiaya Jia (2022). “MAT: Mask-Aware Transformer for Large Hole Image Inpainting”. arXiv:2203.15270 [cs.CV]. https://arxiv.org/abs/2203.15270
Olaf Ronneberger, Philipp Fischer, Thomas Brox (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation”. arXiv:1505.04597 [cs.CV]. https://arxiv.org/abs/1505.04597
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen (2022). “Hierarchical Text-Conditional Image Generation with CLIP Latents”. arXiv:2204.06125 [cs.CV]. https://arxiv.org/abs/2204.06125
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer (2021). "High-Resolution Image Synthesis with Latent Diffusion Models". arXiv:2112.10752 [cs.CV]. https://arxiv.org/abs/2112.10752

‍

Lets Talk