Deep Unsupervised Learning using Nonequilibrium Thermodynamics (ICML 2015)

Created

2024/05/14 04:39

2014: The Emergence of GANs

•

June 2014: Generative Adversarial Networks (GANs) is introduced by Ian Goodfellow. This method involves two neural networks, a generator and a discriminator, which are trained simultaneously.

2015-2018: GAN Diversification

•

Rapid advancements and diversification of GAN, including Deep Convolutional GANs (DCGANs), Conditional GANs (cGANs), and CycleGANs 

2019-2020: StyleGAN

•

February 2019:  StyleGAN (by NVIDIA that can generate highly realistic and customizable images, especially human faces

For quite a long time, the progress in image generation with AI was mainly focused on GAN-based models. I do not want to get into too much details, but as stated in the timeline,” this method involves two neural networks, a generator and a discriminator, which are trained simultaneously.”

Image Source: Introduction to GAN

Essentially, it is as if we are repeatedly trying to generate a perfectly looking fake money, by constantly being checked by a discriminator.

However, the limitation of GAN was:

 It does not have conditional generation, such as having prompt-based inputs like DALL E would allow

 Training GAN is often challenging, as the model might fail to converge.

2020-Present: Diffusion Models Enter the Scene

•

2020 Onwards: Diffusion models began to gain more attention.

Diffusion model generated outcomes in a much more straight-forward and intuitive way.

It offered more flexible and conditional generation, as DALL E3 allows by incorporating CLIP (Text + Image) style embeddings.

I am also planning to touch upon this in the future, but now diffusion is even being applied to the field of BioML, where it is used to predict protein structures.

RFDiffusion from BakerLab:

Paper

Now, the paper that will be discussed today is “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, a paper that provides a basis for all these emerging diffusion models. As we can see from the title, the idea of this paper was inspired by the nature of thermodynamics. This is really the beauty of science and technology - the idea and concept borrowed from one field creates a new ripple of change for a seemingly different field.

Now, let’s dig into the paper!

Source: Medium by J. Rafid Siddiqui

The main idea of diffusion is going through iterative noise adding process until it becomes a complete random noise output (or often Gaussian). Then, if we track the reverse process, we would be able to tract and analyze the data distribution of the data.

Source: Original Paper

As the title of the paper suggests, the idea was “inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process”. The concept of “quasi-static process” in physics is defined as the process where the system changes in a way where it still stays close to the equilibrium state at each step of change. Diffusion model mimics this idea in the process of adding a incremental and evaluable noise in each forward step process, allowing the reverse de-noising process to be possible.

“we explicitly define the probabilistic model as the endpoint of the Markov chain. Since each step in the diffusion chain has an analytically evaluable probability, the full chain can also be analytically evaluated.”  - page 2

Forward Trajectory (basically performing T steps of adding noise)

q(x(t) | x(t−1)) corresponds to either Gaussian diffusion into a Gaussian distribution with identity-covariance, or binomial diffusion into an independent binomial distribution.

Reverse Trajectory (reverse process of de-noising)

If we take a closer look:

From Table App. 1

We see how we compute conditional probability for time step t-1, by looking at the mean and covariance from time step t (gaussian denoted by N). Then, this allows us to compute the probability distribution at the corresponding time step for next iteration.

Model Probability

The probability of the generative model for the data distribution is:

This integral is not tractable. Therefore, instead, we compute the relative probability of forward and reverse by averaging over the forward process, inspired by Annealed Importance Sampling.

Training

At the end, most machine learning problems are formulating a proper minimization problem.

Equation (10)

Here, q(x(0)) is the true distribution of the initial data x(0), and p(x(0)) is the model's estimated probability density of the initial data. dx(0) will now note the infinitesimal element in the data space of the original data x(0) for integration.

We want to perform the traditional maximum log likelihood model so that we can maximize this likelihood of true probability q(x(0)) for the estimated p(x(0)). We perform this across the continuous case, getting an integral over the dx(0) elements.

Equation (11)

Dissecting the Expansion we have, we get:

Forward Step

x(1:T) which is all the possible path to the time step T,

q(x(1:T)| x(0)) which gives the weight or likelihood of having that possible path

By taking integrating over all this possible paths, we aggregate over all the possible path for that time step T, essentially getting the equation for the forward process.

Reverse

The second portion of the equation resembles the forward step provided earlier.

Given

We just now have an additional ratio term where it would be a term denoting how accurate the reverse step is with the corresponding forward step.

In summary, this expansion is simply integrating over all this possible paths for forward process, reach the p(x(t)), and reverse it back, getting the p(x(0)) which is the estimated initial state.

Equation (12)

Equation (11) is still quite complicated, so we apply Jensen’s inequality for further simplification.

Now recall the application Jensen’s inequality to expected values, where we have

Jensen’s Inequality

f(E[X])≤E[f(X)] for convex function

Now, since we know that the logarithm function is in fact concave, the result for the above equation (11) will be:

f(E[X])≥E[f(X)]

where we achieve

Equation (13) and (14)

Remember that KL Divergence that measures the difference between two probability distributions over the same variable is given by (in the case of expected values):

Now that we obtained the lower bound from equation (12), we apply the KL Divergence to achieve the equation K in (13).

Training proceeds in the direction of maximizing the model log likelihood.

Once again, according to Jensen's inequality, L has a lower bound.

As per the appendix, K can be expressed as a combination of KL divergence and entropies:

Since the entropies and KL divergence are computable, K is computable. The equality holds when forward and reverse are the same, thus if

{\beta}_t

is sufficiently small, L can be considered almost equal to K.

Learning to find the reverse Markov transition is equivalent to maximizing this lower bound.

The choice of

{\beta}_t

is crucial to the model's performance. For Gaussian distributions,

{\beta}_2

⋯

{\beta}_T

was determined through gradient ascent on K, and

{\beta}_1

was kept fixed to prevent overfitting.

Modified Marginal Distributions

Originally, computing posterior (conditional probability resulting from prior) for inference requires multiplication of multiple distribution with second distribution. However, since this is costly, the author proposes the use of modified marginal distributions.

In diffusion model, the second distribution is simply a small perturbation to each step, making this step simpler.

The intermediate distribution is multiplied by the corresponding function and normalized. The tilde denotes that corresponding trajectory accounting this perturbation.

Modified Marginal Distributions

Thus, from the modified marginal distribution, now the reverse diffusion process we want becomes

From

In order to satisfy or obtain (20), we would have:

And making corresponding normalized distribution form:

Applying r(x)

As we can see from the above equation (22), if r(x(t)) is sufficiently smooth, it can be treated as a small perturbation to the reverse diffusion kernel.

It was noted from the author that this r(x(t)) was set to be constant, as the “second choice r(x(t)) makes no contribution to the starting distribution for the reverse trajectory.”

Entropy

As we know the forwards process, we calculate entropy as below:

Experiment/Results

(a) Original bark image

(b) The same image with 100x100 pixel region replaced with isotropic Gaussian Noise

We see that the above image is replaced with the trained distribution output, recovering a nearly identical distribution as we see below

It was tested on other common image dataset as well

Conclusion

•

Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive.

GAN introduced earlier in the post was also suffering from expensive modeling

•

core of the algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution

As noted earlier, diffusion models are not limited in its use to generating images. It is currently being used in BioML field to predict protein structures, for example. I am starting to see that the boundaries between different domains are getting more and more thin. It is really beautiful to see how a progress in one area can indirectly or unexpectedly lead to a development of a different field. I believe this is the beauty of science - when a researcher place a tool to the idea bank, thousands of different researchers are able to make use of that. I am also working hard to become one of them.

Written By: Terry (Taehan) Kim,

Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)

References

some references for further reading and understanding:

•

Ho et al. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.

•

Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.

•

Meng et al. (2023). SIGGRAPH 2023 Course on Diffusion Models. SIGGRAPH ’23 Courses, August 06-10, 2023. ACM, New York, NY, USA, 113 pages.

2015 ICML Talk

Nayef Paper Review