An Overview of Diffusion Models
Published:
Diffusion models have been shown to be a highly promising approach in the field of image generation. They treat image generation as two independent processes: the forward process, which transforms a complex data distribution into a known prior distribution (typically a standard normal distribution) by gradually injecting noise; and the reverse process, which transforms the prior distribution back into the complex data distribution by gradually removing the noise.
The unified forward process — from a discrete perspective
In most cases, the forward process, also known as noise schedules, does not contain any learnable parameters, you only need to “manually” define it in advance. We assume that the data distribution is $p_{data}$, while the prior distribution is $p_{init}$. For any time step $t$, the noised image $x_t$ can be obtained by adding noise $\varepsilon$ ( $\varepsilon \sim p_{init}$) to a real image $x_0$ ( $x_0 \sim p_{data}$). We can formalize it as the following formula:
\[x_t=s(t)*x_0+\sigma(t)*\varepsilon\label{eq:discrete-forward}\]Where $s(t)$ represents the signal coefficient, and $\sigma(t)$ represents the noise coefficient. The two mainstream types of noise schedules are Variance Preserving (VP) and Variance Exploding (VE).
VP: At any time step $t$, The forward (noising) process gradually decays the original signal while injecting matching noise so that the total variance stays constant (usually 1), which means that: $Var(x_t)=s(t)^2+\sigma(t)^2=1$. We can rewrite the VP-type noise scheduling with DDPM format:
\[x_t=\sqrt {\bar {\alpha_t}}*x_0+ \sqrt {1-\bar {\alpha_t}} *\varepsilon\]Where $s(t) = \sqrt {\bar {\alpha_t}}$ , $\sigma(t) = \sqrt {1-\bar {\alpha_t}} $. Two commonly used VP-type noise scheduling include linear scheduling (DDPM1 and DDIM2) and cosine scheduling (iDDPM3)
VE: The forward process preserves the original signal intact and continuously adds noise, and eventually variance grows unbounded over time. We can rewrite the VE-type noise scheduling with NCSN 4 format:
\[x_t= x_0+ \sigma(t) *\varepsilon\]From discrete perspective to continuous perspective
Diffusion models originally operate in discrete time, with a fixed number of timesteps $t=1,2,\dots,T$, as $T \to \infty$, time becomes a continuous variable $t \in [0,1]$. From this point of view, song et al 5 present a stochastic differential equation (SDE) to unify the existing discrete diffusion models.
Briefly speaking, the discrete noise schedule defined in formula $\ref{eq:discrete-forward}$ can be
\[dx_t=f(t)x_tdt + g(t)dw_t\label{eq:sde-forward}\]The question is when given $s(t)$ and $\sigma(t)$, how can we derive the expressions for $f(t)$ and $g(t)$ , and vice versa.
By observing Equation $\ref{eq:discrete-forward}$ , it is not difficult to find that $x_t$ follows a Gaussian distribution with mean $m(x_t)=s(t)*x_0$ and variance $var(x_t)=\sigma(t)^2$.
\[x_t \sim \mathcal N (s(t)*x_0, \sigma(t)^2*I)\]Now let’s consider the SDE in continuous form ( equation \ref{eq:sde-forward}). What are its mean and variance?
First, integrate both sides of equation \ref{eq:sde-forward} from 0 to $t$ and simplify the result.
\[\begin{align*} & x_t = x_0 + \int_0^tf(s)x_sds + \int_0^tg(s)dw_s \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t|x_0) = \mathbb E(x_0|x_0) + \mathbb E (\int_0^tf(s)x_sds |x_0) + \mathbb E(\int_0^tg(s)dw_s|x_0) \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t|x_0) = \mathbb E(x_0|x_0) + \int_0^tf(s)*\mathbb E (x_s|x_0)*ds + \int_0^tg(s)*\mathbb E(dw_s|x_0) \end{align*}\]Where $\mathbb E(x_t|x_0)$ is the mean we need to compute, denoted as $m(x_t)$, according to the properties of the Wiener process, we have $\mathbb E(dw_s|x_0) =0$, take the derivative of both sides of the above equation with respect to $t$, and simplify.
\[m'(x_t) = f(t)*m(t)\label{eq:mean_sde}\]Equation \ref{eq:mean_sde} is a simple linear ordinary differential equation, solving it directly yields $m(x_t)=e^{\int_0^tf(s)ds}*x_0$ (the initial value is $m(x_0)=x_0$). Next, we need to derive the variance of SDE, according to Ito’s lemma 6,
\[\begin{align*} & dx_t^2 = 2x_tdx_t + g^2(t)dt \\ \Longrightarrow \ \ \ \ & dx_t^2=2x_t*(f(t)x_tdt+g(t)dw_t)+g^2(t)dt \\ \Longrightarrow \ \ \ \ & dx_t^2=(2f(t)x_t^2+g^2(t))dt+g(t)dw_t \end{align*}\]integrate both sides from 0 to $t$ and simplify the result.
\[\begin{align*} & x_t^2 = x_0^2 + \int_0^t(2f(s)x_s^2+g^2(s))ds + \int_0^tg(s)dw_s \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \mathbb E (\int_0^t(2f(t)x_t^2+g^2(t))ds |x_0) + \mathbb E(\int_0^tg(s)dw_s|x_0) \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \int_0^t(2*f(s)*\mathbb E (x_s^2|x_0)+g^2(s))ds + \int_0^tg(s)*\mathbb E(dw_s|x_0) \\ \Longrightarrow \ \ \ \ & \mathbb E(x_t^2|x_0) = \mathbb E(x_0^2|x_0) + \int_0^t(2*f(s)*\mathbb E (x_s^2|x_0)+g^2(s))ds \end{align*}\]Set $m_2(x_t)= \mathbb E(x_t^2|x_0) $, take the derivative of both sides of the above equation with respect to $t$, and simplify
\[m_2'(x_t) = 2f(t)m_2(x_t)+g^2(t)\]Reverse Process
PF-ODE
References
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851. ↩
Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020. ↩
Nichol A Q, Dhariwal P. Improved denoising diffusion probabilistic models[C]//International conference on machine learning. PMLR, 2021: 8162-8171. ↩
Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution[J]. Advances in neural information processing systems, 2019, 32. ↩
Song Y, Sohl-Dickstein J, Kingma D P, et al. Score-based generative modeling through stochastic differential equations[J]. arXiv preprint arXiv:2011.13456, 2020. ↩
Itô K. On a formula concerning stochastic differentials[J]. Nagoya Mathematical Journal, 1951, 3: 55-65. ↩