Max's AI Playground
On this blog, I delve into practical AI tools and applications. I like bleeding edge technology and API's and explore interesting use cases and new tools.
I love Laravel (TALL Stack) ❤️ and all things JS.
May 8, 2024

Single Image Transformations: Exploring Instruct Pix2Pix in Stable Diffusion

What exactly is Pix2Pix? This Stable Diffusion model transforms images based solely on textual instructions. Timothy Brooks, the model’s creator, defines it as “Learning to Follow Image Editing Instructions”. The simplicity of Pix2Pix opens a realm of possibilities for anyone interested in AI-driven image editing.

The Idea: Transform any image with minimal effort.

The Goal: To manipulate images quickly and without any prior editing skills (aka Photoshop).

The Use Case: This exploration started with an aim to automate the generation of YouTube thumbnails by changing facial expressions via simple commands.

The Annoyance: Traditional photo editing requires time and effort I’d rather not spend, and manual edits in Photoshop or inpainting in Stable Diffusion often lead to frustrating cycles of trial and error. 😬

Table of Contents

  1. Set Up InstructPix2Pix
  2. Choose a Base Image
  3. Select an Effective Prompt
  4. Examples of Transformations
  5. Comparisons with Embeddings and epiCPhotoGasm

Step 1: Setting Up the Instruct Pix2Pix Model (MacBook M1)

Download and Select the Model

Ensure the web UI is operational by following the official instructions.

Side note: All of the following steps were taken on a MacBook M1.

Initially, download the ckpt or safetensors model from the Hugging Face repository and place it in the models\Stable-diffusion directory. Refresh and select the instruct-pix2pix-00-22000 model from the dropdown menu. instruct-pix2pix-00-22000

Additional note:
I encountered issues using ControlNet with the error message: Cannot recognize the ControlModel. Although it ultimately did not affect the outcomes, to avoid potential issues, I recommend not enabling ControlNet and loading the model directly as a Stable Diffusion checkpoint in case you also get this error.

Step 2: Choose Your Base Image

A younger and prettier version of me 😳
A younger and prettier version of me 😳

Step 3: Crafting Effective Prompts

To maximize the model’s effectiveness, articulate your desired changes as if you were instructing Photoshop. This model excels when directives are precise, whether it’s altering the lighting, adjusting colors, or removing and replacing elements.

Here are some successfully tested prompt examples (credit to Andrew from stable-diffusion-art.com):

Important: Set the Denoising strength to 1 to ensure the model functions properly.

4. Examples

Let’s explore a range of outcomes from successful transformations to… learning experiences. 🤓

showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make it look like a golden statue
prompt: make it look like a golden statue
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.75, Seed: 3207543649, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: This is one of the prompts from the author's website and it works quite well here, too.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make the hair and eyebrows blond
prompt: make the hair and eyebrows blond
Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 10, Image CFG scale: 1.75, Seed: 1977613539, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Funny, actually.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt close their eyes
prompt: close their eyes
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.6, Seed: 1405665529, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: This prompt worked very well and it's pretty realistic, too.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make him look furious
prompt: make him look furious
Negative prompt: bad eyes, low quality Steps: 40, Sampler: Euler, Schedule type: Automatic, CFG scale: 7.5, Image CFG scale: 1.6, Seed: 1580611629, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Excellent result, one of the best transformations.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make him angry
prompt: make him angry
Steps: 38, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.55, Seed: 2086482578, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: If I was you, I'd leg it! 🔥 Quite a good result, but the mouth requires some inpainting.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make him terrifying
prompt: make him terrifying
Steps: 50, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 9, Image CFG scale: 1.4, Seed: 4091701604, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: This is just a... fail!
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make him look more mature
prompt: make him look more mature
Negative prompt: bad teeth, bad quality, medium quality, blurry Steps: 42, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.75, Seed: 353877794, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Accepted! This is impressive and I could imagine myself looking like this one day, perhaps... 🤔
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt apply face paint
prompt: apply face paint
Steps: 50, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.5, Seed: 2442579842, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: I'm undecided on this one. Might have to give it a try some time. 😅
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt pixelate the background
prompt: pixelate the background
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.25, Seed: 3683370930, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: I love the pattern on this.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt what would it look like if it were snowing?
prompt: what would it look like if it were snowing?
Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.5, Seed: 538119155, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: I really like this one. This would have taken ages in Photoshop, at least for me...
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make them wear a suit
prompt: make them wear a suit
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 30, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.6, Seed: 3460861800, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Suit up! 🕴️
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt add sunglasses
prompt: add sunglasses
Negative prompt: bad teeth, bad quality, medium quality, blurry Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 10, Image CFG scale: 1.75, Seed: 3201085422, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Wow, talk about extravagant!
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make the person 10 years younger
prompt: make the person 10 years younger
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 40, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.5, Seed: 3919771410, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Doesn't really look much younger but rather more southern (that means Italian since that is south of Germany)
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make them look like a celebrity
prompt: make them look like a celebrity
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 36, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8.5, Image CFG scale: 1.1, Seed: 1197305648, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Seems like I might need a nose job and a hairstyle update in the middle...
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make them look like a celebrity
prompt: make them look like a celebrity
Negative prompt: Disfigured, cartoon, blurry, nude Steps: 36, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8.5, Image CFG scale: 1.1, Seed: 1197305647, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: A sexy version of me.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make the hair gray
prompt: make the hair gray
Steps: 42, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.5, Seed: 2180229889, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: One more version of sexy me since it's so much fun!
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt make his sweater a leather jacket
prompt: make his sweater a leather jacket
Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.5, Seed: 291088540, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Impressive, but it looks almost like there's also a leather bag on the top of the left shoulder. I suppose this is due to the source image being a bit wrinkly in that area.

Observations

Pix2Pix shines in style changes and object replacement but struggles with complex facial expressions and scenery adjustments. The key to make it work often lies in adjusting the Image CFG and CFG Scales based on what the prompt demands. It fails on viewpoint changes and sometimes it fails to isolate the specified object.

If an image doesn’t change at all, usually you’d lower either the Image CFG Scale (try 1.25 or even lower if 1.5 does nothing for you) or increase the CFG Scale of the actual prompt (try 8-9 if 7.5 was your baseline).

Evoking strong emotions from a neutral facial expression is challenging with Pix2Pix. In such cases, I recommend using a custom embedding or checkpoint, as illustrated below.

Takeaways

Challenges: Changing facial expressions is as tough as traditional image editing. Subtle changes are more reliably achieved than complete overhauls.

Strengths: Ideal for straightforward tasks like color changes or background swaps.

Limitations: Struggles with dramatic emotional expressions due to issues with detailing in areas like the mouth and eyes.

Remember, the effectiveness of Pix2Pix can vary dramatically based on the specificity of your prompts and the settings you choose. What works in Photoshop can often be replicated here, albeit with some practice and patience.

Taking It further

Changing Facial Expressions

To more effectively alter facial expressions, an alternative approach using custom embeddings proves more promising.

In order to accomplish this, we need to use another method. I explored several embeddings over at civitai such as Nervous512, Grin512 or Sad512.

Here are some specific emotions tailored for different expressions:

After downloading these embeddings, simply place them in the embeddings folder and use them as follows:

a <embedding_name> man, e.g. a happy512 man or a angry512 man

5. Examples Using Embeddings

showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt a happy512 man
prompt: a happy512 man
Steps: 32, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.5, Seed: 481831985, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, TI hashes: "happy512: 4fa643103a06", Version: v1.9.3
My opinion: Achieving a realistic smile is challenging. The embedding has a hard time to get the teeth right. Additional inpainting is required.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt a portrait of smile512 man
prompt: a portrait of smile512 man
Negative prompt: low quality, deformed Steps: 30, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.9, Seed: 1944199414, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
My opinion: Teeth are slightly better, but not great either. We'll have to make the extra effort and use inpainting here as well.

Using inpainting might yield similar results, but it requires more effort to manually create masks for areas like the eyes and cheeks. In contrast, using an embedding automates this process, considering the entire face without the need for detailed manual adjustments.

In Comparison

Let’s compare pix2pix with the smile512 embedding:

showing result of pix2pix using prompt make him smile
prompt: make him smile
Negative prompt: low quality, deformed Steps: 30, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.9, Seed: 1944199414, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
showing result of embedding using prompt a smile512 man
prompt: a smile512 man
Negative prompt: low quality, deformed Steps: 33, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7.5, Image CFG scale: 1.8, Seed: 2661397260, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, TI hashes: "smile512: 2ad4e0cac932", Version: v1.9.3
My opinion: Tough to say which one is the winner here... Both aren't exactly great. It's probably a tie.

Now let’s compare pix2pix with the sad512 embedding:

showing result of pix2pix using prompt make him ((sad))
prompt: make him ((sad))
Negative prompt: low quality, deformed Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.5, Seed: 3712488430, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
showing result of embedding using prompt a portrait of a (sad512) man
prompt: a portrait of a (sad512) man
Negative prompt: low quality, deformed Steps: 32, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.35, Seed: 1736296393, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, TI hashes: "sad512: d27225db52e6", Version: v1.9.3
My opinion: Rendered a deeper sadness but introduced some unwanted color shifts. I'd say left (pix2pix) is the winner here.

Alternate Approach Using ADetailer and epiCPhotoGasm

A popular checkpoint on civitai, epiCPhotoGasm, offers an alternate method for facial manipulation. After downloading, place it in your models\Stable-Diffusion folder. ADetailer, utilizing face_yolov8n.pt, focuses modifications on facial features when img2img is enabled—ideal for precise adjustments.

Interestingly, epiCPhotoGasm operates nearly twice as fast as the Pix2Pix model on my setup, showing promising results:

Comparative Outcomes Using ADetailer and epiCPhotoGasm

showing result of pix2pix using prompt make him ((sad))
prompt: make him ((sad))
Negative prompt: low quality, deformed Steps: 24, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8, Image CFG scale: 1.5, Seed: 3712488430, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
showing result of embedding using prompt a portrait of a ((sad)) man
prompt: a portrait of a ((sad)) man
Steps: 1, Sampler: Euler, Schedule type: Automatic, CFG scale: 7.5, Seed: 1373581547, Size: 128x128, Model hash: e44c7b30c6, Model: epicphotogasm_ultimateFidelity, Denoising strength: 1, ADetailer model: face_yolov8n.pt, ADetailer confidence: 0.3, ADetailer dilate erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.45, ADetailer inpaint only masked: True, ADetailer inpaint padding: 96, ADetailer use separate CFG scale: True, ADetailer CFG scale: 9.0, ADetailer use separate CLIP skip: True, ADetailer CLIP skip: 2, ADetailer version: 24.4.2, Version: v1.9.3
My opinion: Pix2Pix offers a reliable baseline, while epiCPhotoGasm allows for more nuanced expressions with some trade-offs.
showing result of pix2pix using prompt make him look (shocked)
prompt: make him look (shocked)
Steps: 30, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 8.5, Image CFG scale: 1.45, Seed: 3780110914, Size: 512x512, Model hash: ffd280ddcf, Model: instruct-pix2pix-00-22000, Denoising strength: 1, Version: v1.9.3
showing result of embedding using prompt a portrait of a ((shocked)) man
prompt: a portrait of a ((shocked)) man
Steps: 1, Sampler: Euler, Schedule type: Automatic, CFG scale: 10, Seed: 626819233, Size: 128x128, Model hash: e44c7b30c6, Model: epicphotogasm_ultimateFidelity, Denoising strength: 1, ADetailer model: face_yolov8n.pt, ADetailer confidence: 0.3, ADetailer dilate erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.44, ADetailer inpaint only masked: True, ADetailer inpaint padding: 84, ADetailer use separate CFG scale: True, ADetailer CFG scale: 8.5, ADetailer use separate CLIP skip: True, ADetailer CLIP skip: 2, ADetailer version: 24.4.2, Version: v1.9.3
My opinion: This one was tough for pix2px, as seems to be the case when trying to use strong facial expressions such as shock. epiCPhotoGasm is the clear winner here.

The Inpaint denoising strength is crucial here. The default setting of 0.4 generally works well, but slight adjustments can greatly influence the outcome, sometimes at the expense of character recognizability.

Experiment with Inpaint only masked padding, pixels to potentially achieve a broader range of facial expressions. Increasing this setting to about 100 has proven effective in some of my tests.

Here are the complete settings I’ve used in ADetailer for most of my comparisons: ADetailer Settings

This model works very well with age manipulation and ethnicities. Here are some examples:

showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt a portrait of an old man
prompt: a portrait of an old man
Steps: 1, Sampler: Euler, Schedule type: Automatic, CFG scale: 7.5, Seed: 2440307707, Size: 128x128, Model hash: e44c7b30c6, Model: epicphotogasm_ultimateFidelity, Denoising strength: 1, ADetailer model: person_yolov8n-seg.pt, ADetailer confidence: 0.3, ADetailer dilate erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.35, ADetailer inpaint only masked: True, ADetailer inpaint padding: 64, ADetailer use separate CFG scale: True, ADetailer CFG scale: 8.0, ADetailer use separate CLIP skip: True, ADetailer CLIP skip: 2, ADetailer version: 24.4.2, Version: v1.9.3
My opinion: Very impressive!
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt a portrait of a young man
prompt: a portrait of a young man
Steps: 1, Sampler: Euler, Schedule type: Automatic, CFG scale: 7.5, Seed: 1667935321, Size: 128x128, Model hash: e44c7b30c6, Model: epicphotogasm_ultimateFidelity, Denoising strength: 1, ADetailer model: person_yolov8n-seg.pt, ADetailer confidence: 0.3, ADetailer dilate erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.36, ADetailer inpaint only masked: True, ADetailer inpaint padding: 64, ADetailer use separate CFG scale: True, ADetailer CFG scale: 8.5, ADetailer use separate CLIP skip: True, ADetailer CLIP skip: 2, ADetailer version: 24.4.2, Version: v1.9.3
My opinion: Besides minor issues with the right eye, a very good result.
showing base image before pix2pix conversion showing pix2pix stable diffusion result of prompt a portrait of a japanese man
prompt: a portrait of a japanese man
Steps: 1, Sampler: Euler, Schedule type: Automatic, CFG scale: 7.5, Seed: 1776054093, Size: 128x128, Model hash: e44c7b30c6, Model: epicphotogasm_ultimateFidelity, Denoising strength: 1, ADetailer model: person_yolov8n-seg.pt, ADetailer confidence: 0.3, ADetailer dilate erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.36, ADetailer inpaint only masked: True, ADetailer inpaint padding: 64, ADetailer use separate CFG scale: True, ADetailer CFG scale: 8.5, ADetailer use separate CLIP skip: True, ADetailer CLIP skip: 2, ADetailer version: 24.4.2, Version: v1.9.3
My opinion: Interesting for sure. 🇯🇵

Conclusion

While epiCPhotoGasm has outperformed other methods in terms of speed and ease of use for my specific needs—such as altering facial expressions from neutral to more expressive states—it is not without its flaws. The results, while quick, may not always be reliable enough for applications like YouTube thumbnail generation where accuracy in expression and a flawless result is crucial.

Unfortunately, the current solutions require significant tweaking to meet my needs fully. In my opinion, the currently available methods aren’t quite there yet without going the extra mile or adding a disproportionate ton of effort.

Check below for an interesting upcoming project that I’ll be testing out once the code has been released.

Tools Used In This Post

Further Reading and Resources

There are two interesting projects that have just popped up (early May 2024):

Both of these look promising and may be able to more easily alter facial expressions. Stay tuned as I’ll be test driving new methods soon.

Alright, that's a wrap. If you want, subscribe for more interesting AI projects and demos such as this. Enter your email below and get notified when I publish new articles.

Max's AI Playground

    I'm happy to respect your privacy. 🔐

    See you next time,
    Max