Part A: The Power of Diffusion Models

Precomputed Text Embeddings

num_inference_steps = 20

img
an oil painting of a snowy mountain village
img
a man wearing a hat
img
a rocket ship

The pictures are all accurate to the prompts, but you can see there are some strange details, like the man being cross-eyed.

img
num_inference_steps = 10
img
num_inference_steps = 20
img
num_inference_steps = 100

I think the middle rocket looks best, but the rightmost one could be considered the most "detailed", with more colors and shading.

I used 180 as my seed.

Forward Process

img
Campanile
img
Noisy Campanile at t=250
img
Noisy Campanile at t=500
img
Noisy Campanile at t=750

Classical Denoising

img
Noisy Campanile at t=250
img
Noisy Campanile at t=500
img
Noisy Campanile at t=750
img
Gaussian Blur Denoising at t=250
img
Gaussian Blur Denoising at t=500
img
Gaussian Blur Denoising at t=750

One-Step Denoising

img
Noisy Campanile at t=250
img
Noisy Campanile at t=500
img
Noisy Campanile at t=750
img
One-Step Denoised Campanile at t=250
img
One-Step Denoised Campanile at t=500
img
One-Step Denoised Campanile at t=750

Iterative Denoising

img
Noisy Campanile at t=90
img
Noisy Campanile at t=240
img
Noisy Campanile at t=390
img
Noisy Campanile at t=540
img
Noisy Campanile at t=690
img
Original
img
Iteratively Denoised
img
One-Step Denoised
img
Gaussian Blurred

Diffusion Model Sampling

img
Sample 1
img
Sample 2
img
Sample 3
img
Sample 4
img
Sample 5

Classifier-Free Guidance

These were a lot better than in the previous part.

img
Sample 1
img
Sample 2
img
Sample 3
img
Sample 4
img
Sample 5

Image-to-Image Translation

With prompt "a high quality photo"

img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Campanile

Trying this with my own images!

img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Bins
img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Guitar

I then tried this with images from the web and hand drawn images.

img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Teddy Bear
img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Spongebob
img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Flower
img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Duck

For the hand-drawn images, the i-start=20 images came extremely close to the original image! It even looks a little better because it adds proper shading, which can be seen in the duck.

Inpainting

By applying masks to images and only denoising within those masks (keeping the rest of the image as the original), we can create inpainted images.

img
Original
img
Mask
img
Hole to Fill
img
Inpainted Image
img
Original
img
Mask
img
Hole to Fill
img
Inpainted Image
img
Original
img
Mask
img
Hole to Fill
img
Inpainted Image

Text-Conditional Image to Image

This was essentially the same as the other image-to-imgae translation, but with new prompts instead of "a high quality photo"

"a rocket ship"

img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Campanile
img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Spongebob

"a photo of a dog"

img
i_start=1
img
i_start=3
img
i_start=5
img
i_start=7
img
i_start=10
img
i_start=20
img
Spongebob

Visual Anagrams

We can create visual anagrams by iteratively denoising both an image and its flipped version, each with different prompts. We average the two noise estimates at each step.

Prompts: "an oil painting of people around a campfire", "an oil painting of an old man"

img
img

Prompts: "an oil painting of a snowy mountain village", "a photo of the amalfi cost"

img
img

Prompts: "a guitar", "a wine bottle"

img
img

Hybrid Images

This used a similar technique as the previous part, but with noise estimates from high and low frequencies instead of an image and its flipped version. We can see the low frequency image when the image is far away (or smaller), and the high frequency image when the image is up close, or bigger.

Prompts: "a lithograph of a skull", "a lithograph of waterfalls"

img1
img1

Prompts: "a photo of a dog", "an oil painting of an old man"

img1
img1

I thought this one was particularly interesting because it interpreted "oil painting" as a picture of the painting itself, not making the whole picture an oil painting. The legs of the easel enable the dog to have legs!

Prompts: "a lithograph of a skull", "an oil painting of a snowy mountain village"

img1
img1
Click here to go to part B