Deep Unsupervised Learning using Nonequilibrium Thermodynamics (ICML 2015)

Created
2024/05/14 04:39
Tags
English
Date
Person
한국어 버전
Background
Recently, with the rapid development of artificial intelligence, it started touching upon the area that we believed would take much longer - art and image generation. I am pretty sure most of us would have seen some images online where AI was used to modify the background of sceneries or images like below. This was very cool, but at least for me, it was not to an extent where it seemed ground breaking.
Image Source: cycleGAN Medium
However, we are recently seeing an astonishing level of progress. DALL-E 3 was officially launched on August 20, 2023, where users can simply type in a prompt to generate realistic images or artwork. It now not only enables simple edits on an existing image, but generates a whole new image based on prompts.
What are these technologies and how do they work? As there are many different progress involved, I won't be able to touch upon everything. In this blog post, I aim to provide a general overview of the timeline and progress on image editing and generation, and the diffusion model.
Time Line

2014: The Emergence of GANs

June 2014: Generative Adversarial Networks (GANs) is introduced by Ian Goodfellow. This method involves two neural networks, a generator and a discriminator, which are trained simultaneously.

2015-2018: GAN Diversification

Rapid advancements and diversification of GAN, including Deep Convolutional GANs (DCGANs), Conditional GANs (cGANs), and CycleGANs

2019-2020: StyleGAN

February 2019: StyleGAN (by NVIDIA that can generate highly realistic and customizable images, especially human faces
For quite a long time, the progress in image generation with AI was mainly focused on GAN-based models. I do not want to get into too much details, but as stated in the timeline,” this method involves two neural networks, a generator and a discriminator, which are trained simultaneously.”
Image Source: Introduction to GAN
Essentially, it is as if we are repeatedly trying to generate a perfectly looking fake money, by constantly being checked by a discriminator.
However, the limitation of GAN was:
1.
It does not have conditional generation, such as having prompt-based inputs like DALL E would allow
2.
Training GAN is often challenging, as the model might fail to converge.

2020-Present: Diffusion Models Enter the Scene

2020 Onwards: Diffusion models began to gain more attention.
1.
Diffusion model generated outcomes in a much more straight-forward and intuitive way.
2.
It offered more flexible and conditional generation, as DALL E3 allows by incorporating CLIP (Text + Image) style embeddings.
I am also planning to touch upon this in the future, but now diffusion is even being applied to the field of BioML, where it is used to predict protein structures.
Paper
Now, the paper that will be discussed today is “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, a paper that provides a basis for all these emerging diffusion models. As we can see from the title, the idea of this paper was inspired by the nature of thermodynamics. This is really the beauty of science and technology - the idea and concept borrowed from one field creates a new ripple of change for a seemingly different field.
Now, let’s dig into the paper!
The main idea of diffusion is going through iterative noise adding process until it becomes a complete random noise output (or often Gaussian). Then, if we track the reverse process, we would be able to tract and analyze the data distribution of the data.
As the title of the paper suggests, the idea was “inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process”. The concept of “quasi-static process” in physics is defined as the process where the system changes in a way where it still stays close to the equilibrium state at each step of change. Diffusion model mimics this idea in the process of adding a incremental and evaluable noise in each forward step process, allowing the reverse de-noising process to be possible.
“we explicitly define the probabilistic model as the endpoint of the Markov chain. Since each step in the diffusion chain has an analytically evaluable probability, the full chain can also be analytically evaluated.” - page 2

Forward Trajectory (basically performing T steps of adding noise)

q(x(t) | x(t−1)) corresponds to either Gaussian diffusion into a Gaussian distribution with identity-covariance, or binomial diffusion into an independent binomial distribution.

Reverse Trajectory (reverse process of de-noising)

If we take a closer look:
From Table App. 1
We see how we compute conditional probability for time step t-1, by looking at the mean and covariance from time step t (gaussian denoted by N). Then, this allows us to compute the probability distribution at the corresponding time step for next iteration.

Model Probability

The probability of the generative model for the data distribution is:
This integral is not tractable. Therefore, instead, we compute the relative probability of forward and reverse by averaging over the forward process, inspired by Annealed Importance Sampling.

Training

At the end, most machine learning problems are formulating a proper minimization problem.
Equation (10)
Here, q(x(0)) is the true distribution of the initial data x(0), and p(x(0)) is the model's estimated probability density of the initial data. dx(0) will now note the infinitesimal element in the data space of the original data x(0) for integration.
We want to perform the traditional maximum log likelihood model so that we can maximize this likelihood of true probability q(x(0)) for the estimated p(x(0)). We perform this across the continuous case, getting an integral over the dx(0) elements.
Equation (11)
Dissecting the Expansion we have, we get:
Forward Step
x(1:T) which is all the possible path to the time step T,
q(x(1:T)| x(0)) which gives the weight or likelihood of having that possible path
By taking integrating over all this possible paths, we aggregate over all the possible path for that time step T, essentially getting the equation for the forward process.
Reverse
The second portion of the equation resembles the forward step provided earlier.
Given
We just now have an additional ratio term where it would be a term denoting how accurate the reverse step is with the corresponding forward step.
In summary, this expansion is simply integrating over all this possible paths for forward process, reach the p(x(t)), and reverse it back, getting the p(x(0)) which is the estimated initial state.
Equation (12)
Equation (11) is still quite complicated, so we apply Jensen’s inequality for further simplification.
Now recall the application Jensen’s inequality to expected values, where we have
Jensen’s Inequality
f(E[X])≤E[f(X)] for convex function
Now, since we know that the logarithm function is in fact concave, the result for the above equation (11) will be:
f(E[X])≥E[f(X)]
where we achieve
Equation (13) and (14)
Remember that KL Divergence that measures the difference between two probability distributions over the same variable is given by (in the case of expected values):
Now that we obtained the lower bound from equation (12), we apply the KL Divergence to achieve the equation K in (13).
Training proceeds in the direction of maximizing the model log likelihood.
Once again, according to Jensen's inequality, L has a lower bound.
As per the appendix, K can be expressed as a combination of KL divergence and entropies:
Since the entropies and KL divergence are computable, K is computable. The equality holds when forward and reverse are the same, thus if βt{\beta}_t is sufficiently small, L can be considered almost equal to K.
Learning to find the reverse Markov transition is equivalent to maximizing this lower bound.
The choice of βt{\beta}_t is crucial to the model's performance. For Gaussian distributions, β2{\beta}_2 βT{\beta}_T was determined through gradient ascent on K, and β1{\beta}_1 was kept fixed to prevent overfitting.
Modified Marginal Distributions
Originally, computing posterior (conditional probability resulting from prior) for inference requires multiplication of multiple distribution with second distribution. However, since this is costly, the author proposes the use of modified marginal distributions.
In diffusion model, the second distribution is simply a small perturbation to each step, making this step simpler.
The intermediate distribution is multiplied by the corresponding function and normalized. The tilde denotes that corresponding trajectory accounting this perturbation.
Modified Marginal Distributions
Thus, from the modified marginal distribution, now the reverse diffusion process we want becomes
From
To
In order to satisfy or obtain (20), we would have:
And making corresponding normalized distribution form:
Applying r(x)
As we can see from the above equation (22), if r(x(t)) is sufficiently smooth, it can be treated as a small perturbation to the reverse diffusion kernel.
It was noted from the author that this r(x(t)) was set to be constant, as the “second choice r(x(t)) makes no contribution to the starting distribution for the reverse trajectory.”
Entropy
As we know the forwards process, we calculate entropy as below:
Experiment/Results
(a) Original bark image
(b) The same image with 100x100 pixel region replaced with isotropic Gaussian Noise
(c) The recovered image with diffusion probabilistic model trained
We see that the above image is replaced with the trained distribution output, recovering a nearly identical distribution as we see below
It was tested on other common image dataset as well
Conclusion
Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive.
GAN introduced earlier in the post was also suffering from expensive modeling
core of the algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution
As noted earlier, diffusion models are not limited in its use to generating images. It is currently being used in BioML field to predict protein structures, for example. I am starting to see that the boundaries between different domains are getting more and more thin. It is really beautiful to see how a progress in one area can indirectly or unexpectedly lead to a development of a different field. I believe this is the beauty of science - when a researcher place a tool to the idea bank, thousands of different researchers are able to make use of that. I am also working hard to become one of them.
Written By: Terry (Taehan) Kim,
Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)
References
some references for further reading and understanding:
Ho et al. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.
Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.
Meng et al. (2023). SIGGRAPH 2023 Course on Diffusion Models. SIGGRAPH ’23 Courses, August 06-10, 2023. ACM, New York, NY, USA, 113 pages.