Holiday Sale - 40% OFF on All Courses and Programs

Holiday Sale - 40% OFF on All Courses and Programs

Holiday Sale - 40% OFF on All Courses and Programs

Holiday Sale - 40% OFF on All Courses and Programs

Holiday Sale - 40% OFF on All Courses and Programs

InstructPix2Pix – Edit Images With Prompts

In this article, we discuss what is InstructPix2Pix, how it works, how it is trained, and what kind of image editing it can do.
Instructpix2pix

With the recent boom in Stable Diffusion, generating images from text-based prompts has become quite common.  Image2Image technique follows this approach which allows us to edit images through masking and inpainting. With rapid advancements in this technology, we no longer need masks to edit images. With the new InstructPix2Pix takes this one step further by enabling users to provide natural language prompts that guide the model’s image generation process.

InstructPix2Pix examples.
Figure 1 InstructPix2Pix examples

In this article, we will explore the InstructPix2Pix model, its working, and the kind of results it can generate. 

Upon completion of reading this post, you will possess the necessary skills to effectively utilize InstructPix2Pix in your personal creative endeavors. With the knowledge of the abilities and shortcomings of InstructPix2Pix at your disposal, you can take your image editing skills to the next level.

Table of Contents

What is InstructPix2Pix?

InstructPix2Pix is a diffusion model, a variation of Stable Diffusion, to be particular. But instead of an image generation model (text-to-image), it is an image editing diffusion model. 

Simply put if we give an input image and prompt instructions to InstructPix2Pix, then the model will follow the instructions to edit the image.

It was introduced by Tim Brooks, Aleksander Holynski, and Alexei A. Efros in the paper titled InstructPix2Pix: Learning to Follow Image Editing Instructions.

An example of making a person wear a hat using InstructPix2Pix
Figure 2 An example of making a person wear a hat using InstructPix2Pix

The above figure shows a simple example of the working of InstructPix2Pix. We make the person wear a hat by using very simple instructions.

The question that arises from this is whether a model like InstructPix2Pix  is actually beneficial. The answer is a resounding YES! Quoting the author here:

“Since it performs edits in the forward pass and does not require per-example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.”

The above statement proves that once fine-tuned, we can use the model on various images for inference.

The core working of InstructPix2Pix remains similar to the original Stable Diffusion model. It still has:

  • A large transformer based text encoder.
  • An autoencoder network that encodes images to latent space (only during training). The decoder decodes the final output of the UNet and upsamples it.
  • And a UNet model that works in the latent space to predict the noise.

Most of the changes happen in the training phase of the model. In fact, InstructPix2Pix is the culmination of training two different models. The technical details of the model and the training procedure will be addressed in the next section.

How Does InstructPix2Pix Work?

In this section, we will focus on the working of InstructPix2Pix. This includes two things:

  • Training the InstructPix2Pix model.
  • Inference using the model.

The entire training procedure involves training two separate models. InstructPix2Pix is a multi-modal model which deals with both texts and images. It tries to edit an image based on instructions. This is very difficult to achieve using a single model.

Hence, the authors combine the power of the GPT-3 language model and Stable Diffusion. The GPT-3 is used to generate text prompts (instructions) and a variation of the Stable Diffusion model is trained on these instructions and generated images.

Let’s take a closer look at the training procedure.

Fine Tuning GPT-3

The authors fine-tune the GPT-3 model to create a multi-model training dataset. We can break down the entire process into the following points:

  • We provide an input caption to the GPT-3 model, and it trains on the caption.
  • After training, It learns to generate the edit instructions and the final caption.

For example, our input caption is:

“A cat sitting on a couch”. 

The edit instruction can be: 

“cat wearing a hat”

In this case, the final caption can be:

“A cat wearing a hat sitting on a couch”.

Note: We only input the first caption to the GPT-3 model. It learns to generate both the edit instructions and the final edited caption as well. 

Interestingly, the GPT-3 model was fine-tuned on a relatively small dataset. The authors collected just 700 captions from the LAION-Aesthetics dataset. Then the edit instructions and output captions were manually generated to train the model.

Generating caption pairs using GPT-3 for InstructPix2Pix training.
Figure 3 Generating caption pairs using GPT 3 for InstructPix2Pix training

Figure 4 shows the dataset generation process in the first row. As discussed, the edits were manually generated.

The second row shows how the trained GPT-3 model generated the final dataset. 

After training, we can feed a caption to the GPT-3 model, and it generates an edit instruction and the final caption. For training InstructPix2Pix, more than 450,000 caption pairs were generated.

Fine-Tuning Text-to-Image Stable Diffusion Model

After training the GPT-3 model and generating the caption pairs, the next step is fine-tuning a pre-trained Stable Diffusion model.

In the dataset, we have the initial and the final caption. The Stable Diffusion model is trained by generating two images from these caption pairs.

However, there is an issue. Even if we fix the seed and other parameters, Stable Diffusion models tend to generate very different images even with similar prompts. So, the image generated from the edited caption mostly will not resemble an edited image; rather, it will be a different image entirely.

To overcome this, the authors use a technique called Prompt-to-Prompt. This technique allows text-to-image models to generate similar images with similar prompts.

Stable Diffusion image generation example with and without Prompt-to-Prompt.
Figure 4 Stable Diffusion image generation example with and without Prompt to Prompt

The above figure shows an example of Prompt-to-Prompt in action. Both images were generated using Stable Diffusion. The one on the left does not use Prompt-to-Prompt, while the one on the right uses the technique.

The following figure from the paper depicts the training dataset generation and the inference phase of InstructPix2Pix quite well.

InstructPix2Pix training data generation, training, and inference.
Figure 5 InstructPix2Pix training data generation training and inference

As we can see, the inference process is quite simple. We just need an input image and an edit instruction. The InstructPix2Pix model does follow the edit instructions quite well for generating the final images.

Classifier-Free Guidance for IntstructPix2Pix

The final image that InstructPix2Pix generates is a combination of the text prompt and the input image. During inference, we need a way to tell the model where to focus more – the image or text.

This is achieved through Classifier-free guidance which weighs over two conditional inputs. Let’s call these S_T and S_I for the text and input image, respectively.

During inference, guiding the more through S_T will condition the final image closer to the prompt. Similarly, guiding the model more through S_I will make the final image closer to the input image.

Example of Classifier Free Guidance in AI Image Generation.
Figure 6 Example of Classifier Free Guidance in AI Image Generation

During the training phase, implementing classifier-free guidance helps the model learn conditional and unconditional denoising. We can call the image conditioning C_I and text conditioning C_T. For unconditional training, both are set to null value.

When carrying out inference, choosing between S_T (more weight towards prompt) and S_I (more weight towards the initial image) is a tradeoff. We need to choose the settings according to the result that we want and based on our liking.

InstructPix2Pix Results

The results published in the paper are pretty interesting. We can use InstructPix2Pix to recreate images with various artistic mediums or even different environmental styles.

Generating images using InstructPix2Pix in different artistic medium.
Figure 7 Generating images using InstructPix2Pix in different artistic medium

When compared to other Stable Diffusion image editing methods, InstructPix2Pix surpasses almost all of them.

InstructPix2Pix compared with other Stable Diffusion based image generation methods.
Figure 8 InstructPix2Pix compared with other Stable Diffusion based image generation methods

This is not all. InstructPix2Pix can perform other challenging edits as well. Some of them are:

  • Changing clothing style.
  • Changing weather.
  • Replacing background.
  • Changing seasons.

Here are some of our own image edits using InstructPix2Pix.

Changing the weather from sunny to cloud using InstructPix2Pix.
Figure 9 Changing the weather from sunny to cloud using InstructPix2Pix

In the above figure, we change the weather from sunny to cloudy. The model was able to do it using just the instruction “change weather from sunny to cloudy”.

Changing the season from summer to winter using InstructPix2Pix.
Figure 10 Changing the season from summer to winter using InstructPix2Pix

In this case, we tell the InstructPix2Pix model to change the season from summer to winter, which it does so with ease.

Changing artistic material style using InstructPix2Pix.
Figure 11 Changing artistic material style using InstructPix2Pix Here we change the material of a photo to marble and wood

The above results are very interesting in which we tell the InstructPix2Pix model to change the material type. In the first instance, we change the material type to wood, and in the second one to stone.

Applications Using Instruct Pix2Pix

We can build some fairly nice applications using InstructPix2Pix. Two such applications are:

  • Virtual makeup
  • Virtual try-on
  • And virtual hairstylist

InstructPix2Pix is really good at swapping clothing styles, makeup, spectacles, and even hairstyles. The use of InstructPix2Pix as a virtual hair stylist is an interesting one because it is very hard to swap hairstyles in images while keeping the other facial features the same. However, provided with the right image, InstructPix2Pix can carry it out easily.

When used in the right way, applications like virtual makeup and virtual try-on will become easier to build and more accessible to the public also.

Here are some examples of the same.

Changing hair color using InstructPix2Pix
Figure 12 Changing hair color using InstructPix2Pix
Virtual try-on using InstructPix2Pix
Figure 13 Virtual try on using InstructPix2Pix

Want to know how to build such applications using InstructPix2Pix and other Stable Diffusion models? Our new course Mastering AI Art Generation will teach you that and much more. Become a master in AI art generation by joining the new course by OpenCV.

Where to Use InstructPix2Pix?

After going through the post, you may have the urge to try out InstructPix2Pix on your own. Fortunately, there are a couple of ways in which you can easily access InstructPix2Pix.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Automatic1111

If you have a GPU on your local system and are not afraid of tinkering with Python code, then Automatic1111 is one of the best ways to try out InstructPix2Pix. As of now, InstructPix2Pix is part of the web application, and you just need to download the model.

Using InstructPix2Pix in the AUTOMATIC1111 WebUI.
Figure 14 Using InstructPix2Pix in the AUTOMATIC1111 WebUI

In case you do not have a GPU, you can try out InstructPix2Pix with Automatic1111, even on Google Colab.

Official Gradio App

The authors of the paper also provide an official Hugging Face spaces that use the Gradio interface. 

You can access it through the Hugging Face Spaces.

The best part is that you needn’t set up anything locally to use this.

Conclusion

In this article, we covered the InstructPix2Pix model, which can edit images with just text prompts. Along with its training criteria, we also explored the possible applications we can build using it.

Stable Diffusion models have opened a world of possibilities in the creative space. Now, fields like digital art or techno-art are not limited to a selected few. Anybody who has the urge to learn this new technology can create compelling art with just a few lines of text. Be it editing images or creating new art; almost everyone has access to the tools that are built using the diffusion models.

InstructPix2Pix is a groundbreaking image editing tool that allows users to edit images using natural language prompts. With InstructPix2Pix, users can easily add or remove objects, change colors, and manipulate images in ways that were previously difficult or impossible. This tool has the potential to revolutionize the way people edit images, making it more accessible and intuitive than ever before

I invite you to try it yourself with their demo and learn more about their approach by reading the paper or with the code they made publicly available. Let us know in the comments how you used the power of diffusion models and your creativity to create something new.



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​