CS180-Proj5-Diffusion

CS180-Proj5-Diffusion

Part A: Using a diffusion model

Theory of diffusion

Variational Auto Encoder (VAE)

For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable, which we can denote by random variable \(z\). Mathematically, we can denote \(p(x)\) as the probability of the model generates the real image \(x\). Since the decoder generates image from \(z\), so \(p(x)\) is the marginal probability of joint distribution \(p(x, z)\), i.e. \(p(x)=\int p(\tilde{x}=x|z)p(z)dz\). Obviously it is too hard to integrate all possible \(z\), so another way is to use the equation \(p(x) = \frac{p(x,z)}{p(z|x)}\) to calculate it. Still we cannot get the expression of \(p(z|x)\), so we use a deep neural network to imitate it. That is the encoder, we denote it as \(q_{\phi}(z|x)\), where \(\phi\) is the parameters of encoder.

To measure how close these two distributions are, we use kl divergence:

\[ D_{KL}(q_{\phi}(z|x)||p(z|x)) \]

Unfortunately, we cannot access the real distribution \(p(z|x)\), but we have

\[ \log p(\boldsymbol{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\right] + \mathcal{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x}) \mid\mid p(\boldsymbol{z}\mid\boldsymbol{x})) \]

where \(\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\right]\) is known as \(ELBO\) (Evidence Lower Bound). Notice that \(p(x)\) is actually related to the decoder's params \(\theta\), so if we fix the \(\theta\), we have:

\[ \log p(\boldsymbol{x}) = Constant = \boldsymbol{ELBO}(\phi)+KL \]

So to minimize KL, we maximize ELBO.

Variational Diffusion Models

\[ p(\boldsymbol{x}, \boldsymbol{z}_{1:T}) = p(\boldsymbol{z}_T)p_{\boldsymbol{\theta}}(\boldsymbol{x}\mid\boldsymbol{z}_1)\prod_{t=2}^{T}p_{\boldsymbol{\theta}}(\boldsymbol{z}_{t-1}\mid\boldsymbol{z}_{t}) \]

and its posterior is:

\[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:T}\mid\boldsymbol{x}) = q_{\boldsymbol{\phi}}(\boldsymbol{z}_1\mid\boldsymbol{x})\prod_{t=2}^{T}q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t}\mid\boldsymbol{z}_{t-1}) \]

There are 3 key restrictions:

  1. The latent dimension is exactly equal to the data dimension
  2. The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
  3. The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep \(T\) is a standard Gaussian

For example, we have a real image which is \(x\), we assume it will finally become a pure gaussian distribution (noise) at \(z_T\) with same size by adding gaussian noise:

This means every \(q_{\phi}(x_t|x_{t-1})\) is a Gaussian, we define it as:

\[ q_{\phi}(x_t|x_{t-1}) = \mathcal{N}(x_t | \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)\bold{I}) \]

where \(\alpha_t\) is a hyperparameter.

Deliverables (Part A)

Seeds: 42

Part 0: Setup

  • My prompts:

    • a photo of the Great Wall
    • a photo of beautiful woman
    • a photo of a car
    • a photo of a cat
    • an oil painting of people around a campfire
    • an oil painting of an old lady
    • a portrait of Steve Jobs
    • a football
    • a movie star
    • a rainbow flag
    • a soft bed
  • Selected prompts:

    • a photo of the Great Wall
    • an oil painting of people around a campfire
    • a portrait of Steve Jobs
  • Results:

    • num_inference steps = 20 num_infer=20

    • num_inference steps = 200

Part 1: Sampling Loops

Part 1.1: Implementing the Forward Process

  • Implement noisy_im = forward(im, t) function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def forward(im, t):
"""
Args:
im : torch tensor of size (1, 3, 64, 64) representing the clean image
t : integer timestep

Returns:
im_noisy : torch tensor of size (1, 3, 64, 64) representing the noisy image at timestep t
"""
with torch.no_grad():
# alphas_cumprod is a torch tensor of size (T,) defined in the main script
alpha_t = alphas_cumprod[t].to(im.device)

epsilon = torch.randn_like(im)

# x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon
im_noisy = torch.sqrt(alpha_t) * im + torch.sqrt(1 - alpha_t) * epsilon
return im_noisy
  • Results:
    • Original image:
    • Noisy image at t=250, 500 and 750:

Part 1.2: Classical Denoising:

  • Results:

Part 1.3: One-Step Denoising

Using the prompt 'a high quality photo' and a pretrained model to denoise the noisy image at timestep t=250, 500, and 750.

  • Results:

Part 1.4: Iterative Denoising

  • Results with i_start = 10 and stride = 30:

Part 1.5: Diffusion Model Sampling

This part simply generates samples from pure noise and prompt "a high quality photo" using the diffusion model sampling loop.

Part 1.6: Classifier-Free Guidance(CFG)

  • Implement the iterative_denoising_cfg function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def iterative_denoise_cfg(im_noisy, i_start, prompt_embeds, uncond_prompt_embeds, timesteps, scale=7, display=True):
image = im_noisy
progress_imgs = {}

with torch.no_grad():
for i in range(i_start, len(timesteps) - 1):
t = timesteps[i]
prev_t = timesteps[i+1]

# Get `alpha_cumprod`, `alpha_cumprod_prev`, `alpha`, `beta`
alpha_cumprod_t = alphas_cumprod[t].to(image.device)
alpha_cumprod_prev = alphas_cumprod[prev_t].to(image.device)

alpha_t = alpha_cumprod_t / alpha_cumprod_prev
beta_t = 1 - alpha_t

# Get cond noise estimate
model_output = stage_1.unet(
image,
t,
encoder_hidden_states=prompt_embeds,
return_dict=False
)[0]

# Get uncond noise estimate
uncond_model_output = stage_1.unet(
image,
t,
encoder_hidden_states=uncond_prompt_embeds,
return_dict=False
)[0]

# Split estimate into noise and variance estimate
noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)
uncond_noise_est, _ = torch.split(uncond_model_output, image.shape[1], dim=1)

# Compute the CFG noise estimate based on equation 4
cfg_noise_est = uncond_noise_est + scale * (noise_est - uncond_noise_est)

# Get `pred_prev_image`, the next less noisy image.
pred_x0 = (image - torch.sqrt(1 - alpha_cumprod_t) * cfg_noise_est) / torch.sqrt(alpha_cumprod_t)

coeff_x0 = (torch.sqrt(alpha_cumprod_prev) * beta_t) / (1 - alpha_cumprod_t)
coeff_xt = (torch.sqrt(alpha_t) * (1 - alpha_cumprod_prev)) / (1 - alpha_cumprod_t)

pred_prev_image_mean = coeff_x0 * pred_x0 + coeff_xt * image

if prev_t > 0:
pred_prev_image = add_variance(predicted_variance, t, pred_prev_image_mean)
else:
pred_prev_image = pred_prev_image_mean

image = pred_prev_image

if display and i % 5 == 0:
# Just store the first image in the batch for progress display
progress_imgs[f'Step {i}'] = image[0].permute(1, 2, 0).cpu().detach() / 2.0 + 0.5

if display:
print("Denoising Progress (CFG):")
media.show_images(progress_imgs)

clean = image.cpu().detach().permute(0, 2, 3, 1).numpy() / 2.0 + 0.5

return clean
  • Results:

Part 1.7.1: Image-to-Image Translation

  1. for web image:

    • Original image:
    • SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):
  2. for handdrawn image 1:

    • Original image:
    • SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):
  3. for handdrawn image 2:

    • Original image:
    • SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):

Part 1.7.2 & 1.7.3: Inpainting

  • Original image and mask:

  • Replace the region with "a rocket ship" prompt:

  • Original image and mask:

  • Replace the region with "an oil painting of an old man" prompt:

  • Original image and mask:

  • Replace the region with "a photo of a dog" prompt:

Part 1.8: Visual Anagrams

  • Results:

Part 1.9: Hybrid Images

The skull litograph is really suitable for low frequency part cause it is easy to recognize the skull shape even with low details.

  • Results:

Deliverable (Part B)

Part 1.1 & 1.2 Using the UNet to Train a Denoiser

A visualization of the noising process using \(\sigma=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\)

Part 1.2.1 Training (Unconditional, \(\sigma=0.5\))

After 1 epoch:

After 5 epoch:

Part 1.2.2 Out-of-Distribution Testing

test on \(\sigma=0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0\):

Part 1.2.3 Denoising Pure Noise

This time we start from pure noise and use the trained denoiser to iteratively denoise it.

The final denoised image looks like the average of all digits, since the model is trained to denoise from various noisy images of different digits, so when starting from pure noise, it cannot figure out which digit to generate, thus generates an average digit.

Part 2.1 & 2.2: Adding Time Conditioning to UNet & Training (Time-Conditioned)

loss curve during training:

Part 2.3: Sampling from the UNet

Part 2.4 & 2.5: Adding Class-Conditioning to UNet & Training (Class-Conditioned)

Part 2.6: Class-Conditioned Sampling from the UNet

  • epoch 1 animation:

  • epoch 5 animation:

  • epoch 10 animation:

If we remove the scheduler and just use a constant learning rate, the loss curve looks like this:

To get a similar result, I use a smaller learning rate of 3e-3 and train for 15 epochs:

  • epoch 1 animation:

  • epoch 5 animation:

  • epoch 10 animation:


CS180-Proj5-Diffusion
https://lceoliu.github.io/2025/11/24/CS180-Proj5-Diffusion/
作者
Chang Leo
发布于
2025年11月24日
许可协议