CS180-Proj5-Diffusion

Part A: Using a diffusion model

Theory of diffusion

Variational Auto Encoder (VAE)

For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable, which we can denote by random variable \(z\). Mathematically, we can denote \(p(x)\) as the probability of the model generates the real image \(x\). Since the decoder generates image from \(z\), so \(p(x)\) is the marginal probability of joint distribution \(p(x, z)\), i.e. \(p(x)=\int p(\tilde{x}=x|z)p(z)dz\). Obviously it is too hard to integrate all possible \(z\), so another way is to use the equation \(p(x) = \frac{p(x,z)}{p(z|x)}\) to calculate it. Still we cannot get the expression of \(p(z|x)\), so we use a deep neural network to imitate it. That is the encoder, we denote it as \(q_{\phi}(z|x)\), where \(\phi\) is the parameters of encoder.

To measure how close these two distributions are, we use kl divergence:

\[ D_{KL}(q_{\phi}(z|x)||p(z|x)) \]

Unfortunately, we cannot access the real distribution \(p(z|x)\), but we have

\[ \log p(\boldsymbol{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\right] + \mathcal{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x}) \mid\mid p(\boldsymbol{z}\mid\boldsymbol{x})) \]

where \(\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\left[\log\frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z}\mid\boldsymbol{x})}\right]\) is known as \(ELBO\) (Evidence Lower Bound). Notice that \(p(x)\) is actually related to the decoder's params \(\theta\), so if we fix the \(\theta\), we have:

\[ \log p(\boldsymbol{x}) = Constant = \boldsymbol{ELBO}(\phi)+KL \]

So to minimize KL, we maximize ELBO.

Variational Diffusion Models

\[ p(\boldsymbol{x}, \boldsymbol{z}_{1:T}) = p(\boldsymbol{z}_T)p_{\boldsymbol{\theta}}(\boldsymbol{x}\mid\boldsymbol{z}_1)\prod_{t=2}^{T}p_{\boldsymbol{\theta}}(\boldsymbol{z}_{t-1}\mid\boldsymbol{z}_{t}) \]

and its posterior is:

\[ q_{\boldsymbol{\phi}}(\boldsymbol{z}_{1:T}\mid\boldsymbol{x}) = q_{\boldsymbol{\phi}}(\boldsymbol{z}_1\mid\boldsymbol{x})\prod_{t=2}^{T}q_{\boldsymbol{\phi}}(\boldsymbol{z}_{t}\mid\boldsymbol{z}_{t-1}) \]

There are 3 key restrictions:

The latent dimension is exactly equal to the data dimension
The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep \(T\) is a standard Gaussian

For example, we have a real image which is \(x\), we assume it will finally become a pure gaussian distribution (noise) at \(z_T\) with same size by adding gaussian noise:

This means every \(q_{\phi}(x_t|x_{t-1})\) is a Gaussian, we define it as:

\[ q_{\phi}(x_t|x_{t-1}) = \mathcal{N}(x_t | \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)\bold{I}) \]

where \(\alpha_t\) is a hyperparameter.

Deliverables (Part A)

Seeds: 42

Part 0: Setup

My prompts:
- a photo of the Great Wall
- a photo of beautiful woman
- a photo of a car
- a photo of a cat
- an oil painting of people around a campfire
- an oil painting of an old lady
- a portrait of Steve Jobs
- a football
- a movie star
- a rainbow flag
- a soft bed
Selected prompts:
- a photo of the Great Wall
- an oil painting of people around a campfire
- a portrait of Steve Jobs
Results:
- num_inference steps = 20
- num_inference steps = 200

Part 1: Sampling Loops

Part 1.1: Implementing the Forward Process

Implement noisy_im = forward(im, t) function:

def forward(im, t):
  """
  Args:
    im : torch tensor of size (1, 3, 64, 64) representing the clean image
    t : integer timestep

  Returns:
    im_noisy : torch tensor of size (1, 3, 64, 64) representing the noisy image at timestep t
  """
  with torch.no_grad():
    # alphas_cumprod is a torch tensor of size (T,) defined in the main script
    alpha_t = alphas_cumprod[t].to(im.device)

    epsilon = torch.randn_like(im)

    # x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon
    im_noisy = torch.sqrt(alpha_t) * im + torch.sqrt(1 - alpha_t) * epsilon
  return im_noisy

Results:
- Original image:
- Noisy image at t=250, 500 and 750:

Part 1.2: Classical Denoising:

Results:

Part 1.3: One-Step Denoising

Using the prompt 'a high quality photo' and a pretrained model to denoise the noisy image at timestep t=250, 500, and 750.

Results:

Part 1.4: Iterative Denoising

Results with i_start = 10 and stride = 30:

Part 1.5: Diffusion Model Sampling

This part simply generates samples from pure noise and prompt "a high quality photo" using the diffusion model sampling loop.

Part 1.6: Classifier-Free Guidance(CFG)

Implement the iterative_denoising_cfg function:

def iterative_denoise_cfg(im_noisy, i_start, prompt_embeds, uncond_prompt_embeds, timesteps, scale=7, display=True):
  image = im_noisy
  progress_imgs = {}

  with torch.no_grad():
    for i in range(i_start, len(timesteps) - 1):
      t = timesteps[i]
      prev_t = timesteps[i+1]

      # Get `alpha_cumprod`, `alpha_cumprod_prev`, `alpha`, `beta`
      alpha_cumprod_t = alphas_cumprod[t].to(image.device)
      alpha_cumprod_prev = alphas_cumprod[prev_t].to(image.device)

      alpha_t = alpha_cumprod_t / alpha_cumprod_prev
      beta_t = 1 - alpha_t

      # Get cond noise estimate
      model_output = stage_1.unet(
          image,
          t,
          encoder_hidden_states=prompt_embeds,
          return_dict=False
      )[0]

      # Get uncond noise estimate
      uncond_model_output = stage_1.unet(
          image,
          t,
          encoder_hidden_states=uncond_prompt_embeds,
          return_dict=False
      )[0]

      # Split estimate into noise and variance estimate
      noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)
      uncond_noise_est, _ = torch.split(uncond_model_output, image.shape[1], dim=1)

      # Compute the CFG noise estimate based on equation 4
      cfg_noise_est = uncond_noise_est + scale * (noise_est - uncond_noise_est)

      # Get `pred_prev_image`, the next less noisy image.
      pred_x0 = (image - torch.sqrt(1 - alpha_cumprod_t) * cfg_noise_est) / torch.sqrt(alpha_cumprod_t)

      coeff_x0 = (torch.sqrt(alpha_cumprod_prev) * beta_t) / (1 - alpha_cumprod_t)
      coeff_xt = (torch.sqrt(alpha_t) * (1 - alpha_cumprod_prev)) / (1 - alpha_cumprod_t)

      pred_prev_image_mean = coeff_x0 * pred_x0 + coeff_xt * image

      if prev_t > 0:
          pred_prev_image = add_variance(predicted_variance, t, pred_prev_image_mean)
      else:
          pred_prev_image = pred_prev_image_mean

      image = pred_prev_image

      if display and i % 5 == 0:
          # Just store the first image in the batch for progress display
          progress_imgs[f'Step {i}'] = image[0].permute(1, 2, 0).cpu().detach() / 2.0 + 0.5

    if display:
        print("Denoising Progress (CFG):")
        media.show_images(progress_imgs)

    clean = image.cpu().detach().permute(0, 2, 3, 1).numpy() / 2.0 + 0.5

  return clean

Results:

Part 1.7.1: Image-to-Image Translation

for web image:
- Original image:
- SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):
for handdrawn image 1:
- Original image:
- SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):
for handdrawn image 2:
- Original image:
- SDEdit result (for noise levels [1, 3, 5, 7, 10, 20] ):

Part 1.7.2 & 1.7.3: Inpainting

Original image and mask:
Replace the region with "a rocket ship" prompt:
Original image and mask:
Replace the region with "an oil painting of an old man" prompt:
Original image and mask:
Replace the region with "a photo of a dog" prompt:

Part 1.8: Visual Anagrams

Results:

Part 1.9: Hybrid Images

The skull litograph is really suitable for low frequency part cause it is easy to recognize the skull shape even with low details.

Results:

Deliverable (Part B)

Part 1.1 & 1.2 Using the UNet to Train a Denoiser

A visualization of the noising process using \(\sigma=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\)

Part 1.2.1 Training (Unconditional, \(\sigma=0.5\))

After 1 epoch:

After 5 epoch:

Part 1.2.2 Out-of-Distribution Testing

test on \(\sigma=0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0\):

Part 1.2.3 Denoising Pure Noise

This time we start from pure noise and use the trained denoiser to iteratively denoise it.

The final denoised image looks like the average of all digits, since the model is trained to denoise from various noisy images of different digits, so when starting from pure noise, it cannot figure out which digit to generate, thus generates an average digit.