한국어 버전
Introduction
So in the previous post, we reviewed the paper “Deep Unsupervised Learning using Non-equilibrium Thermodynamics”, introducing the general concept of diffusion models and how it works. Though the paper proposed an innovative idea by borrowing the concept from thermodynamics, GAN was still the more prevalent model that was used for image generation.
2014: The Emergence of GANs
•
June 2014: Generative Adversarial Networks (GANs) is introduced by Ian Goodfellow. This method involves two neural networks, a generator and a discriminator, which are trained simultaneously.
2015-2018: GAN Diversification
•
Rapid advancements and diversification of GAN, including Deep Convolutional GANs (DCGANs), Conditional GANs (cGANs), and CycleGANs
2019-2020: StyleGAN
•
February 2019: StyleGAN (by NVIDIA that can generate highly realistic and customizable images, especially human faces
However, this trend completely changes 2020 onwards.
2020-Present: Diffusion Models Enter the Scene
The paper I will review would be one of the main paper that brought this shift of attention towards the diffusion models.
<Denoising Diffusion Probabilistic Models (NeurIPS 2020)>
Recap
Giving a short recap of what was covered from the previous post. The general idea of diffusion model was having a forward diffusion process of adding noise, and reverse de-noising process that we learn to recover the original image.
Forward Trajectory (basically performing T steps of adding noise)
q(x(t) | x(t−1)) corresponds to either Gaussian diffusion into a Gaussian distribution with identity-covariance, or binomial diffusion into an independent binomial distribution.
Reverse Trajectory (reverse process of de-noising)
This paper published in 2015 was left in the archive for a while as GAN models gained attention, until few researchers brought it back. The work was mainly inspired by “Deep Unsupervised Learning using Non-equilibrium Thermodynamics” and “Generative Modeling by Estimating Gradients of the Data Distribution”.
Background
As we have reviewed from the previous post, optimizing over the log likelihood and applying KL divergence, we obtained the tractable equation. For recap:
Objective
So now the objective for reverse step is as below:
After algebraic steps, we obtain the loss term approximation. Essentially we are simply calculating the L2 Loss between the mean of the predicted/learned distribution and the mean of the true given distribution. This intuitively makes sense.
The gist of this paper is from the red-boxed area.
Researchers asked: Instead of trying to predict the entire distribution, what if we just try to predict the epsilon or that small perturbation?
Algorithm and Training
So we focus on the epsilon as mentioned
Algorithm 1: Training
2: sample an image
3: sample a time stamp
4: sample gaussian noise
5: create an image with noise added to it (where x0 is the original image, scale down, and add gaussian noise)
Note: alpha t, larger it is closer to 1, the noisier it is. (This is depending on the sample → higher time → further in diffusion process → more noise added)
So we are making the neural network to learn the epsilon theta that would subtract and recover the original image
The epsilon theta is essentially the learned likely direction. That subtracted epsilon is basically making x more likely (derivative of x), prioritizing that portion for de-noising.
Algorithm 2: Sampling
•
Take current xt for sample
•
Calculate the direction for step to denoise it
•
Make the correction by subtracting (denoise)
•
If not final state, add that gaussian noise of sigma tz → stay within the distribution
Inspired by “Generative Modeling by Estimating Gradients of the Data Distribution”, in step 4, it uses Langevin sampling for sampling.
Generative Modeling by Estimating Gradients of the Data Distribution:
•
Annealed Langevin Dynamic for Sampling against energy model with score function (grad log p(x))
Loss:
As noted, simply just aims to minimize the difference between the true noise ϵ added to the data and the noise ϵθ predicted by the model.
Architecture
Backbone of PixelCNN++ (U-Net based on a Wide ResNet)
Results
IS: How good belonging to a class
FID: Diversity - how well covered in terms of class
Relationship to Auto-regressive Model
•
instead of adding gaussian noise, blanking out each pixel
•
Not exactly same as adding gaussian noise, but is a possible process
•
Reverse process is basically filling back up that pixel
•
Kind of a auto-regressive model
Taking a peek at implementations
Now that we got the main idea, let’s try to find out how this is actually implemented with code!
The author of the video also provides a google collab notebook for exercise below:
<Forward Process>
The forward process is fairly simple, as we are simply adding gaussian noises each step.
<Loss and Update>
Loss:
As noted, simply just aims to minimize the difference between the true noise ϵ added to the data and the noise ϵθ predicted by the model.
Notice how we are simply just computing the loss between the predicted noise and the actual noise
<Reverse Process>
Notice how the output is the noise in the image. This is aligned to the previous explanation we saw in the paper that essentially we want to predict the epsilon theta.
Note that since we have to note the diffusion time step T, we would have to some how incorporate that information into our embedding. We see that we use positional embedding just like the transformer model (unlike RNN, we wanted to embed the positional embeddings as the input is not sequential).
<Sampling / Inference>:
Previously explained that
Algorithm 2: Sampling
•
Take current xt for sample
•
Calculate the direction for step to denies it
•
Make the correction by subtracting (denoise)
•
If not final state, add that gaussian noise of sigma tz → stay within the distribution
Observe how we are essentially just performing:
current_image_at_T - noise_prediction [pre-calculated (model x, t)]
while in Time_Step = 0, we return the noiseless image, and otherwise add the noise from the forward process
Conclusion:
GAN was originally still the prevalent model that was used for image generation. However, after these break throughs and improvements, the diffusion model started performing better than GAN models in terms of efficiency and diversity. OpenAI’s DALL E clearly demonstrated this in place. Now that we reviewed the basics of diffusion models from the two posts, hope everyone now has much better idea of how the images are generated by AI. It is not some magic, but beautiful math and statistics involved.
Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)