An Overview of Diffusion Models

9 minute read

Published:

Diffusion models have been shown to be a highly promising approach in the field of image generation. They treat image generation as two independent processes: the forward process, which transforms a complex data distribution into a known prior distribution (typically a standard normal distribution) by gradually injecting noise; and the reverse process, which transforms the prior distribution back into the complex data distribution by gradually removing the noise.

An Overview of Diffusion Models


The unified forward process — from a discrete perspective

In most cases, the forward process, also known as noise schedules, does not contain any learnable parameters, you only need to “manually” define it in advance. We assume that the data distribution is $p_{data}$, while the prior distribution is $p_{init}$. For any time step $t$, the noised image $x_t$ can be obtained by adding noise $\varepsilon$ ( $\varepsilon \sim p_{init}$) to a real image $x_0$ ( $x_0 \sim p_{data}$). We can formalize it as the following formula:

\[x_t=s(t)*x_0+\sigma(t)*\varepsilon\label{eq:1}\]

Where $s(t)$ represents the signal coefficient, and $\sigma(t)$ represents the noise coefficient. The two mainstream types of noise schedules are Variance Preserving (VP) and Variance Exploding (VE).

VP: At any time step $t$, The forward (noising) process gradually decays the original signal while injecting matching noise so that the total variance stays constant (usually 1), which means that: $Var(x_t)=s(t)^2+\sigma(t)^2=1$. We can rewrite the VP-type noise scheduling with DDPM format:

\[x_t=\sqrt {\bar {\alpha_t}}*x_0+ \sqrt {1-\bar {\alpha_t}} *\varepsilon\label{eq:2}\]

Where $s(t) = \sqrt {\bar {\alpha_t}}$ , $\sigma(t) = \sqrt {1-\bar {\alpha_t}} $. Two commonly used VP-type noise scheduling include linear scheduling (DDPM1 and DDIM2) and cosine scheduling (iDDPM3)

VE: The forward process preserves the original signal intact and continuously adds noise, and eventually variance grows unbounded over time. We can rewrite the VE-type noise scheduling with NCSN 4 format:

\[x_t= x_0+ \sigma(t) *\varepsilon\label{eq:3}\]

From discrete perspective to continuous perspective

Diffusion models originally operate in discrete time, with a fixed number of timesteps $t=1,2,\dots,T$, as $T \to \infty$, time becomes a continuous variable $t \in [0,1]$. From this point of view, song et al 5 present a stochastic differential equation (SDE) to unify the existing discrete diffusion models.

Briefly speaking, the discrete noise schedule defined in formula $\ref{eq:1}$ can be

\[dx_t=f(t)x_tdt + g(t)dw_t\label{eq:4}\]

The question is when given $s(t)$ and $\sigma(t)$, how can we derive the expressions for $f(t)$ and $g(t)$ , and vice versa.

By observing Equation $\ref{eq:1}$ , it is not difficult to find that $x_t$ follows a Gaussian distribution with mean $m(x_t)=s(t)*x_0$ and variance $var(x_t)=\sigma(t)^2$.

\[x_t \sim \mathcal N (s(t)*x_0, \sigma(t)^2*I)\label{eq:5}\]

Now let’s consider the SDE in continuous form ( equation \ref{eq:4}). What are its mean and variance?

First, integrate both sides of equation \ref{eq:4} from 0 to $t$ and simplify the result.

\[\begin{align} & x_t = x_0 + \int_0^tf(s)x_sds + \int_0^tg(s)dw_s\label{eq:6} \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t|x_0) = \mathbb E(x_0|x_0) + \mathbb E (\int_0^tf(s)x_sds |x_0) + \mathbb E(\int_0^tg(s)dw_s|x_0)\label{eq:7} \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t|x_0) = \mathbb E(x_0|x_0) + \int_0^tf(s)*\mathbb E (x_s|x_0)*ds + \int_0^tg(s)*\mathbb E(dw_s|x_0)\label{eq:8} \end{align}\]

Where $\mathbb E(x_t|x_0)$ is the mean we need to compute, denoted as $m(x_t)$, according to the properties of the Wiener process, we have $\mathbb E(dw_s|x_0) =0$, take the derivative of both sides of the above equation with respect to $t$, and simplify.

\[m'(x_t) = f(t)*m(t)\label{eq:9}\]

Equation \ref{eq:9} is a simple linear ordinary differential equation, solving it directly yields $m(x_t)=e^{\int_0^tf(s)ds}*x_0$ (the initial value is $m(x_0)=x_0$). Next, we need to derive the variance of SDE, according to Ito’s lemma 6,

\[\begin{align} & dx_t^2 = 2x_tdx_t + g^2(t)dt\label{eq:10} \\ \Longrightarrow \ \ \ \ & dx_t^2=2x_t*(f(t)x_tdt+g(t)dw_t)+g^2(t)dt\label{eq:11} \\ \Longrightarrow \ \ \ \ & dx_t^2=(2f(t)x_t^2+g^2(t))dt+g(t)dw_t\label{eq:12} \end{align}\]

integrate both sides from 0 to $t$ and simplify the result.

\[\begin{align} & x_t^2 = x_0^2 + \int_0^t(2f(s)x_s^2+g^2(s))ds + \int_0^tg(s)dw_s\label{eq:13} \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \mathbb E (\int_0^t(2f(t)x_t^2+g^2(t))ds |x_0) + \mathbb E(\int_0^tg(s)dw_s|x_0)\label{eq:14} \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \int_0^t(2f(s)\mathbb E (x_s^2|x_0)+g^2(s))ds + \int_0^tg(s)\mathbb E(dw_s|x_0)\label{eq:15} \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \int_0^t(2f(s)\mathbb E (x_s^2|x_0)+g^2(s))ds\label{eq:16} \end{align}\]

Set $m_2(x_t)= \mathbb E(x_t^2|x_0) $, take the derivative of both sides of the above equation with respect to $t$, and simplify

\[m_2'(x_t) = 2f(t)m_2(x_t)+g^2(t)\label{eq:17}\]

Recall the variance formula: $var(x_t)=m_2(x_t)-m(x_t)^2$, take derivatives on both sides, and we get $var’(x_t)=m_2’(x_t)-2m(x_t)m’(x_t) $, substitute $m_2’(x_t)$ into the above equation, and simplify to obtain.

\[\begin{align} & var'(x_t)=m_2'(x_t)-2m(x_t)m'(x_t)\label{eq:18} \\ \Longrightarrow \ \ \ \ & var'(x_t)=2*f(t)*m_2(x_t)+g^2(t)-2f(t)*m^2(x_t)\label{eq:19} \\ \Longrightarrow \ \ \ \ & var'(x_t)=2*f(t)*(m_2(x_t) - m^2(x_t))+g^2(t)\label{eq:20} \\ \Longrightarrow \ \ \ \ & var'(x_t)=2*f(t)*var(x_t)+g^2(t)\label{eq:21} \end{align}\]

Equation \ref{eq:21} is the first-Order linear equations, by multiplying the integrating factor $\phi(t)=e^{-2\int_0^tf(s)ds}$, we can get the solution.

\[var(x_t)=\phi^2(t)*\int_0^t \frac{g^2(s)}{\phi^2(s)}ds\label{eq:22}\]

Now we have obtained the mean and variance in the form of a continuous SDE, i.e., $x_t$ satisfies:

\[x_t \sim \mathcal N (\phi(t)*x_0, \phi^2(t)*\int_0^t \frac{g^2(s)}{\phi^2(s)}ds*I),\ \ \ \ \phi(t)=e^{\int_0^tf(s)ds}\label{eq:23}\]

Compare equation \ref{eq:5} and equation \ref{eq:23}, given $s(t)$ and $\sigma(t)$, we can derive the corresponding SDE expression, which satisfies:

\[f(t)=\frac{s'(t)}{s{t}}, \ \ \ \ \ \ g^2(t)=2*(\sigma(t)\sigma'(t)-\frac{s'(t)}{s{t}}*\sigma^2(t))\label{eq:24}\]

Similarly, given SDE expression, we can also derive the corresponding discrete forward processing, which satisfies:

\[s(t)=\phi(t)=e^{\int_0^tf(s)ds}, \ \ \ \ \ \ \sigma^2(t)=\phi^2(t)*\int_0^t \frac{g^2(s)}{\phi^2(s)}ds\label{eq:25}\]

Following the conclusions, we write the corresponding SDE expressions for DDPM and NCSN :

 DDPMNCSN
Discrete forward process$x_t=\sqrt {\bar {\alpha_t}}*x_0+ \sqrt {1-\bar {\alpha_t}} *\varepsilon$$x_t= x_0+\sigma_t*\varepsilon$
$s(t),\ \ \sigma(t)$$s(t)=\sqrt {\bar {\alpha_t}}, \ \ \sigma(t)=\sqrt {1-\bar {\alpha_t}} $$s(t)=1, \ \ \sigma(t)=\sigma_t $
$f(t),\ \ g(t)$$f(t)=-\frac{1}{2}\beta_t,\ \ g(t)=\sqrt {\beta_t}$$f(t)=0,\ \ g(t)=\sqrt {\frac{d\sigma^2_t}{dt}}$
SDE$dx_t=-\frac{1}{2}\beta_tdt+\sqrt {\beta_t}dw_t$$dx_t=\sqrt {\frac{d\sigma^2_t}{dt}}dw_t$

Reverse Process

The reverse process is the opposite of the forward process, by sampling an initial point $x_T$ from the prior distribution $p_{init}$ ($x_T \sim p_{init}$), and finally generating samples $x_0 \sim p_{data}$ through continuous denoising.

According to Anderson equation7, the reverse of a diffusion process is also a diffusion process, running backwards in time and given by the reverse-time SDE:

\[dx_t=\big( f(t)x_t-g^2(t)\nabla_{x_t}logp(x_t) \big)dt + g(t)d\bar w_t\label{eq:26}\]

Here, $\bar w_t$, like $w_t$, is a standard Wiener proces, except that the direction of $w_t$ is reversed from T to 0 (or from 1 to 0 from a continuous perspective). $\nabla_{x_t}logp(x_t)$ is known as score, score refers to the gradient (with respect to $x_t$) of the log-probability density at any time step $t$.

If we know the score for any time step $t$, we can apply equation \ref{eq:26} to generate new samples by solving SDE. Song et al 5 further point out that equation \ref{eq:26} has an equivalent ODE expression, which is also known as Probability-flow ODE (PF-ODE)

\[dx_t=\big( f(t)x_t-\frac{1}{2}g^2(t)\nabla_{x_t}logp(x_t) \big)dt\label{eq:27}\]

Be careful, The equivalent relationship does not mean that the result obtained by solving equation \ref{eq:26} is the same as that obtained by solving equation \ref{eq:27}. Instead, it means that, for any time step $t$, the result obtained by solving equation \ref{eq:26} (denoted as $x_t^1$) has the same marginal probability density as the result obtained by solving equation \ref{eq:27} (denoted as $x_t^2$), i,e. $x_t^1 \sim p(x_t)$ and $x_t^2 \sim p(x_t)$.

**Advantage: ** Thanks to the probability-flow ODE formulation, one can generate samples by integrating a deterministic ODE instead of simulating the full stochastic differential equation. Because the ODE trajectory is fully determined by its initial value, it’s easier to debug and often requires fewer function evaluations than the corresponding SDE sampler—so sampling via the ODE can be substantially faster. By contrast, SDE sampling injects fresh noise at each step, which naturally induces greater randomness (and thus diversity) in the generated outputs.

Training Target and SDE Solving

In previous sections, we have gained a preliminary understanding of the forward and reverse processes of the diffusion model. In order to generate samples, the core lies in solving the score ($\nabla_{x_t}logp(x_t)$) at any time $t$. The process of learning the score function at any time $t$ is called score matching (SM). SM learn to approximate the true score value by training a neural network $s_{\theta}(t, x_t)$.

\[\mathbb E_{t \sim U(0,1)} \big|\big|s_{\theta}(t, x_t) - \nabla_{x_t}logp(x_t) \big|\big|^2\]

However, solving

Conclusion

In this post,

References


  1. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851. 

  2. Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020. 

  3. Nichol A Q, Dhariwal P. Improved denoising diffusion probabilistic models[C]//International conference on machine learning. PMLR, 2021: 8162-8171. 

  4. Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution[J]. Advances in neural information processing systems, 2019, 32. 

  5. Song Y, Sohl-Dickstein J, Kingma D P, et al. Score-based generative modeling through stochastic differential equations[J]. arXiv preprint arXiv:2011.13456, 2020.  2

  6. Itô K. On a formula concerning stochastic differentials[J]. Nagoya Mathematical Journal, 1951, 3: 55-65. 

  7. Anderson B D O. Reverse-time diffusion equation models[J]. Stochastic Processes and their Applications, 1982, 12(3): 313-326.