Denoising Diffusion Probabilistic Models (NeurIPS 2020)

Created

2024/03/27 01:12

Introduction

So in the previous post, we reviewed the paper “Deep Unsupervised Learning using Non-equilibrium Thermodynamics”, introducing the general concept of diffusion models and how it works. Though the paper proposed an innovative idea by borrowing the concept from thermodynamics, GAN was still the more prevalent model that was used for image generation.

2014: The Emergence of GANs

•

June 2014: Generative Adversarial Networks (GANs) is introduced by Ian Goodfellow. This method involves two neural networks, a generator and a discriminator, which are trained simultaneously.

2015-2018: GAN Diversification

•

Rapid advancements and diversification of GAN, including Deep Convolutional GANs (DCGANs), Conditional GANs (cGANs), and CycleGANs 

2019-2020: StyleGAN

•

February 2019:  StyleGAN (by NVIDIA that can generate highly realistic and customizable images, especially human faces

However, this trend completely changes 2020 onwards.

2020-Present: Diffusion Models Enter the Scene

The paper I will review would be one of the main paper that brought this shift of attention towards the diffusion models.

<Denoising Diffusion Probabilistic Models (NeurIPS 2020)>

Recap

Giving a short recap of what was covered from the previous post. The general idea of diffusion model was having a forward diffusion process of adding noise, and reverse de-noising process that we learn to recover the original image.

Source: Medium by J. Rafid Siddiqui

Forward Trajectory (basically performing T steps of adding noise)

q(x(t) | x(t−1)) corresponds to either Gaussian diffusion into a Gaussian distribution with identity-covariance, or binomial diffusion into an independent binomial distribution.

Reverse Trajectory (reverse process of de-noising)

This paper published in 2015 was left in the archive for a while as GAN models gained attention, until few researchers brought it back. The work was mainly inspired by “Deep Unsupervised Learning using Non-equilibrium Thermodynamics” and “Generative Modeling by Estimating Gradients of the Data Distribution”.

Source: L6 Diffusion Models (SP24) Peter Abbeel

Background

As we have reviewed from the previous post, optimizing over the log likelihood and applying KL divergence, we obtained the tractable equation. For recap:

Objective

So now the objective for reverse step is as below:

Source: ML@Berkeley L12 Diffusion Models

After algebraic steps, we obtain the loss term approximation. Essentially we are simply calculating the L2 Loss between the mean of the predicted/learned distribution and the mean of the true given distribution. This intuitively makes sense.

The gist of this paper is from the red-boxed area.

Researchers asked: Instead of trying to predict the entire distribution, what if we just try to predict the epsilon or that small perturbation?

Algorithm and Training

So we focus on the epsilon as mentioned

Algorithm 1: Training

2: sample an image

3: sample a time stamp

4: sample gaussian noise

5: create an image with noise added to it (where x0 is the original image, scale down, and add gaussian noise)

Note: alpha t, larger it is closer to 1, the noisier it is. (This is depending on the sample → higher time → further in diffusion process → more noise added)

So we are making the neural network to learn the epsilon theta that would subtract and recover the original image

Source: L6 Diffusion Slide Pieter Abbeel

The epsilon theta is essentially the learned likely direction. That subtracted epsilon is basically making x more likely (derivative of x), prioritizing that portion for de-noising.

Algorithm 2: Sampling

•

Take current xt for sample

•

Calculate the direction for step to denoise it

•

Make the correction by subtracting (denoise)

•

If not final state, add that gaussian noise of sigma tz → stay within the distribution 

Inspired by “Generative Modeling by Estimating Gradients of the Data Distribution”, in step 4, it uses Langevin sampling for sampling.

Generative Modeling by Estimating Gradients of the Data Distribution:

•

Annealed Langevin Dynamic for Sampling against energy model with score function (grad log p(x)) 

Loss:

As noted, simply just aims to minimize the difference between the true noise ϵ added to the data and the noise ϵθ predicted by the model.

Architecture

Backbone of PixelCNN++ (U-Net based on a Wide ResNet)

Source: U-net Paper

Results

IS: How good belonging to a class

FID: Diversity - how well covered in terms of class

Relationship to Auto-regressive Model

•

instead of adding gaussian noise, blanking out each pixel

•

Not exactly same as adding gaussian noise, but is a possible process

•

Reverse process is basically filling back up that pixel 

•

Kind of a auto-regressive model

Taking a peek at implementations

Now that we got the main idea, let’s try to find out how this is actually implemented with code!

The author of the video also provides a google collab notebook for exercise below:

PyTorch Implementation Notebook

<Forward Process>

The forward process is fairly simple, as we are simply adding gaussian noises each step.

Loss:

As noted, simply just aims to minimize the difference between the true noise ϵ added to the data and the noise ϵθ predicted by the model.

Notice how we are simply just computing the loss between the predicted noise and the actual noise

<Reverse Process>

Notice how the output is the noise in the image. This is aligned to the previous explanation we saw in the paper that essentially we want to predict the epsilon theta.

Note that since we have to note the diffusion time step T, we would have to some how incorporate that information into our embedding. We see that we use positional embedding just like the transformer model (unlike RNN, we wanted to embed the positional embeddings as the input is not sequential).

<Sampling / Inference>:

Previously explained that

Algorithm 2: Sampling

•

Take current xt for sample

•

Calculate the direction for step to denies it

•

Make the correction by subtracting (denoise)

•

If not final state, add that gaussian noise of sigma tz → stay within the distribution 

Observe how we are essentially just performing:

current_image_at_T - noise_prediction [pre-calculated (model x, t)]

while in Time_Step = 0, we return the noiseless image, and otherwise add the noise from the forward process

Conclusion:

GAN was originally still the prevalent model that was used for image generation. However, after these break throughs and improvements, the diffusion model started performing better than GAN models in terms of efficiency and diversity. OpenAI’s DALL E clearly demonstrated this in place. Now that we reviewed the basics of diffusion models from the two posts, hope everyone now has much better idea of how the images are generated by AI. It is not some magic, but beautiful math and statistics involved.

Written By: Terry (Taehan) Kim,

Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)

Denoising Diffusion Probabilistic Models (NeurIPS 2020)

Introduction

2020-Present: Diffusion Models Enter the Scene

<Denoising Diffusion Probabilistic Models (NeurIPS 2020)>

Recap

Forward Trajectory (basically performing T steps of adding noise)

Reverse Trajectory (reverse process of de-noising)

Background

Objective

Algorithm and Training

Architecture

Results

Relationship to Auto-regressive Model

Taking a peek at implementations

<Forward Process>

<Reverse Process>

<Sampling / Inference>:

Conclusion:

References