A Panoramic View of Diffusion Model Sampling: From Classic Theory to Frontier Research
Published:
📚 Table of Contents
- 1. Introduction
- 2. A Tale of Two Paradigms: The Dual Origins of Diffusion Sampling
- 3. The First Leap in Efficiency: Breaking the Markovian Chain with DDIM
- 4. A Confluence of Paths: The Unifying Framework of SDEs and ODEs
- 5. Solver Engineering: Tailor-Made Algorithms for the Probability Flow ODE
- References
This article systematically reviews the development history of sampling techniques in diffusion models. Starting from the two parallel technical routes of score-based models and DDPM, we explain how they achieve theoretical unification through the SDE/ODE framework. On this basis, we delve into various efficient samplers designed for solving the probability flow ODE (PF-ODE), analyzing the evolutionary motivations from the limitations of classical numerical methods to dedicated solvers such as DPM-Solver. Subsequently, the article shifts its perspective to innovations in the sampling paradigm itself, covering cutting-edge technologies such as consistency models and sampling distillation aimed at achieving single-step/few-step generation. Finally, we combine practical strategies such as hybrid sampling and guidance, conduct a comprehensive comparison of existing methods, and look forward to future research directions such as learnable samplers and hardware-aware optimizations.
1. Introduction
In the landscape of modern artificial intelligence, diffusion models have emerged as the undisputed titans of generative tasks. From the breathtaking imagery of Stable Diffusion and Midjourney to the burgeoning field of AI-driven video and audio synthesis, these models have redefined the boundaries of digital creation. At the heart of their power lies a deceptively simple two-act structure: a training phase, where the model learns to systematically denoise corrupted data, and a generation phase, where this learned ability is inverted to create new data from pure noise.
1.1. The Art of Generation: Sampling’s Core Role in Diffusion Models
This article is dedicated to the second act: the process of sampling. If the trained model represents a repository of abstract knowledge about the data distribution—the “what”—then the sampling algorithm is the procedural engine that determines the “how.” It is the chisel that methodically carves a masterpiece from a block of statistical marble. The sampler iteratively refines a random noise tensor, navigating a high-dimensional trajectory through a learned probability landscape, until it converges upon a coherent, high-fidelity artifact.
Therefore, sampling is not merely a passive, final step. It is an active, algorithmic process that profoundly influences the final output’s quality, diversity, and, most critically, the efficiency with which it is produced. Understanding sampling is understanding the very mechanism of creation in diffusion models.
1.2. The Eternal Trade-off: Why Sampling Methods Evolve So Rapidly
The history of diffusion model sampling is a story of a relentless quest for an ideal balance within a challenging trilemma: Speed vs. Quality vs. Diversity.
Speed (Computational Cost): The pioneering DDPM (Denoising Diffusion Probabilistic Models) demonstrated remarkable generation quality but required thousands of sequential function evaluations (NFEs) to produce a single sample. This computational burden was a significant barrier to practical, real-time applications and iterative creative workflows. Consequently, the primary driver for a vast body of research has been the aggressive reduction of these sampling steps.
Quality (Fidelity): A faster sampler is useless if it compromises the model’s generative prowess. The goal is to reduce steps while preserving, or even enhancing, the fidelity of the output. Many methods grapple with issues like error accumulation, which can lead to blurry or artifact-laden results, especially at very low step counts. High-quality sampling means faithfully following the path dictated by the learned model.
Diversity & Stability: Sampling can be either stochastic (introducing randomness at each step) or deterministic (following a fixed path for a given initial noise). Stochastic samplers can generate a wider variety of outputs from the same starting point, while deterministic ones offer reproducibility. The choice between them is application-dependent, and the stability of the numerical methods used, especially for high-order solvers, is a critical area of research.
This perpetual negotiation between speed, quality, and diversity has fueled a Cambrian explosion of innovative sampling algorithms, each attempting to push the Pareto frontier of what is possible.
1.3. A Guide to This Article: A Journey Through Time and Theory
To navigate this complex and rapidly evolving landscape, this article adopts a structured, historical narrative. We will trace the evolution of sampling methodologies from their foundational concepts to the state-of-the-art, ensuring each new development is understood as a logical answer to the limitations of its predecessors. Our journey will proceed as follows:
- We begin by exploring the dual origins of diffusion-based generation: the continuous-time, score-based models (NCSN) with Langevin dynamics, and the discrete-time, denoising-focused models (DDPM).
- Next, we witness the first leap in efficiency with non-Markovian samplers like DDIM, which broke free from the step-by-step constraint of DDPM.
- We then arrive at a pivotal moment of theoretical unification, where the elegant framework of Stochastic Differential Equations (SDEs) and their corresponding Probability Flow ODEs (PF-ODEs) reveals that all these earlier methods are merely different discretizations of the same underlying continuous process.
- Armed with this unified ODE perspective, we will delve into solver engineering: the craft of designing specialized, high-order numerical solvers like DPM-Solver and UniPC that are tailor-made for the unique structure of diffusion ODEs.
- We then elevate our perspective to framework design with EDM, understanding how optimizing the very definition of the diffusion process itself can lead to superior sampling performance.
- Finally, we will survey the paradigm-shifting innovations aiming for the ultimate goal of single-step or few-step generation, including Consistency Models, Knowledge Distillation, and Flow Matching.
By charting this course, this article aims to provide a comprehensive and deeply interconnected understanding of diffusion model sampling, equipping the reader not only with knowledge of individual techniques but also with a robust mental model of the field’s past, present, and future.
2. A Tale of Two Paradigms: The Dual Origins of Diffusion Sampling
Before the unified theories of today, the world of diffusion-based generation was dominated by two distinct, yet parallel, schools of thought. One approached the problem from the continuous perspective of probability density gradients, while the other built a discrete, step-by-step bridge between data and noise. Understanding these two origins—Score-Based Models and Denoising Diffusion Probabilistic Models (DDPMs)—is essential to appreciating the elegant synthesis that followed.
2.1 The Continuous-State Perspective: Score-Based Generative Models
The first paradigm was rooted in a fundamental question: if we had a function that could always point us toward regions of higher data probability, could we generate data by simply “climbing the probability hill”?
2.1.1 The Core Idea: The Wisdom of Score Matching
At the heart of this approach lies a powerful mathematical object: the score function, defined as the gradient of the log-probability density of the data, $\nabla_x \log p_{\text data}(x)$. Intuitively, the score at any point $x$ is a vector that points in the direction of the steepest ascent in data density. To calculate the score function of any input, we train a neural network $s_{\theta}(x)$ (score model) to learn the score
\[s_{\theta}(x) \approx \nabla_x \log p_{\text data}(x)\label{eq:1}\]The loss function is to minimize the Fisher divergence between the true data distribution and the model predicted output:
\[\mathcal{L}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\left[ \big\|\,s_{\theta}(x) - \nabla_x \log p_{\text{data}}(x)\,\big\|_2^2 \right]\label{eq:2}\]The challenge, however, is that we do not know the true data distribution $p_{\text data}(x)$. This is where Score Matching comes in 1. Hyvärinen showed that via integration by parts (under suitable boundary conditions), the objective (equation \ref{eq:2}) can be rewritten in a form only involving the model’s parameters:
\[\begin{align} \mathcal{L}_{\text{SM}}(\theta) & = \mathbb{E}_{x \sim p_{\text{data}}} \left[ \frac{1}{2} \| s_\theta(x) \|^2 + \nabla_x \cdot s_\theta(x) \right] \\[10pt] & \approx \frac{1}{N}\sum_{i=1}^{N} \left[ \frac{1}{2} \| s_\theta(x_i) \|^2 + \nabla_x \cdot s_\theta(x_i) \right] \end{align}\]where $\nabla_x \cdot s_\theta(x) = {\text{trace}(\nabla_x s_\theta(x))}$ is the divergence of the score field. However, SM is not scalable especially for high-dimension data points, because the second term is the jacobin of score model.
For this purpose, Vincent introduced denoising score matching (DSM) 2 , by adding Gaussian noise to the real data $x$, the score model becomes predicting the score field of noised data $\tilde{x} = x + \sigma$.
\[s_{\theta}(\tilde{x}, \sigma) \approx \nabla_{\tilde{x}} \log p_{\sigma}(\tilde{x})\label{eq:5}\]where $p_{\sigma}$ is the data distribution convolved with Gaussian noise of scale $\sigma$, it is easy to verify that predicting $\nabla_{\tilde x} \log p_{\sigma}(\tilde x)$ is equivalent to predict $\nabla_{\tilde{x}} \log p_{\sigma}(\tilde{x} \mid x)$.
\[\begin{align} & \mathbb{E}_{\tilde{x} \sim p_{\sigma}(\tilde{x})}\left[\big\|\,s_{\theta}(\tilde{x}, \sigma) - \nabla_{\tilde x} \log p_{\sigma}(\tilde x)\,\big\|_2^2\right]\label{eq:6} \\[10pt] = \quad & \mathbb{E}_{x \sim p_{\text data}(x)}\mathbb{E}_{\tilde{x} \sim p_{\sigma}(\tilde{x} \mid x)}\left[\big\|\,s_{\theta}(\tilde{x}, \sigma) - \nabla_{\tilde{x}} \log p_{\sigma}(\tilde{x} \mid x)\,\big\|_2^2\right] + \text{const}\label{eq:7} \end{align}\]When $\sigma$ is small enough, $p_{\text data}(x) \approx p_\sigma(x)$. Optimizing Formula \ref{eq:7} does not require knowing the true data distribution, while also avoiding the computation of the complex Jacobian determinant.
\[\begin{align} \nabla_{\tilde{x}} \log p_{\sigma}(\tilde{x} \mid x) & = \nabla_{\tilde{x}} \log \frac{1}{\sqrt{2\pi}\sigma} {\exp}\left( -\frac{(\tilde{x}-x)^2}{2\sigma^2}\right) \\[10pt] & = -\frac{(\tilde{x}-x)}{\sigma^2} \end{align}\]2.1.2 NCSN and Annealed Langevin Dynamics
The first major practical realization of this idea was the Noise-Conditional Score Network (NCSN) 3. However, the authors found that the generated samples by solving \ref{eq:7} with a single small $ \sigma $ are of poor quality, the core issue is that real-world data often lies on a complex, low-dimensional manifold (high-density areas), in the vast empty spaces between data points, the score is ill-defined and difficult to estimate (low-density areas), so low-density regions are underrepresented, leading to poor generalization by the network in those areas. Their solution was ingenious: perturb the data with Gaussian noise of varying magnitudes $\sigma$ ($ \sigma_1 > \sigma_2 > \dots > \sigma_K > 0 $).
Large $ \sigma_i $: The distribution is highly smoothed, covering the global space with small, stable scores in low-density areas (easier to learn).
Small $ \sigma_i $: Focuses on local refinement, closer to $ p_{\text{data}} $.
The loss function now can be rewritten as:
\[\mathcal{L}(\theta) = \mathbb{E}_{\sigma \sim \mathcal{D}}\mathbb{E}_{x \sim p_{\text data}(x)}\mathbb{E}_{\tilde{x} \sim p_{\sigma}(\tilde{x} \mid x)}\left[\big\|\,s_{\theta}(\tilde{x}, \sigma) - \nabla_{\tilde{x}} \log p_{\sigma}(\tilde{x} \mid x)\,\big\|_2^2\right]\label{eq:8}\]With $\lambda{(\sigma)}$ balancing gradients across noise scales and $\mathcal{D}$ is a distribution over $\sigma$. To generate a sample, NCSN employed Annealed Langevin Dynamics (ALD) 3. LD 4 5 treats sampling as stochastic gradient ascent on log-density with Gaussian noise injected to guarantees to converge to the target distribution. ALD is an iterative sampling process:
Start with a sample drawn from pure noise, corresponding to a very high noise level $\sigma_{\text large}$.
Iteratively update the sample using the Langevin dynamics equation:
\[x_{i+1} \leftarrow x_i + \alpha s_{\theta}(x_i, \sigma_i) + \sqrt{2\alpha}z_i\]where $\alpha$ is a step size and $z_i$ is fresh Gaussian noise. This update consists of a “climb” along the score-gradient and a small injection of noise to encourage exploration.
Gradually decrease (or “anneal”) the noise level $\sigma_i$ from large to small. This process is analogous to starting with a blurry, high-level structure and progressively refining it with finer details until a clean sample emerges.
2.1.3 Historical Limitations and Enduring Legacy
The NCSN approach was groundbreaking, but it had its limitations. The Langevin sampling process was inherently stochastic due to the injected noise $z$ and slow, requiring many small steps to ensure stability and quality. The annealing schedule itself was often heuristic.
However, its legacy is profound. NCSN established the score function as a fundamental quantity for generative modeling and introduced the critical technique of conditioning on noise levels. It provided the continuous-space intuition that would become indispensable for later theoretical breakthroughs.
2.2 The Discrete-Time Perspective: Denoising Diffusion Probabilistic Models
Running parallel to the score-based work, a second paradigm emerged, built upon the more structured and mathematically elegant framework of Markov chains.
2.2.1 The Core Idea: An Elegant Markov Chain
DDPM proposed a fixed, two-part process.
Forward (Diffusion) Process: Start with a clean data sample $x_0$. Gradually add a small amount of Gaussian noise at each discrete timestep $t$ over a large number of steps $T$ (typically 1000). This defines a Markov chain $x_0, x_1, \dots, x_T$ where $x_T$ is almost indistinguishable from pure Gaussian noise. This forward process, $q(x_t \mid x_{t-1})$, is fixed and requires no learning.
\[q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\,\sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I)\]Reverse (Denoising) Process: Given a noisy sample $x_t$, how do we predict the slightly less noisy $x_{t-1}$? The goal is to learn a reverse model $p_{\theta}(x_{t-1} \mid x_t)$ to inverts this chain.
2.2.2 Forward Diffusion and Reverse Denoising
According to previous post, the solution of DDPM aims to maximize the log-likelihood of $p_{\text data}(x_0)$, which can be reduced to maximizing the variational lower bound, and maximizing the variational lower bound can further be derived as minimizing the KL divergence between $p_{\theta}(x_{t-1} \mid x_t)$ and $q(x_{t-1} \mid x_t, x_0)$.
\[\mathcal{L}_\text{ELBO} = \mathbb{E}_q \left[ \log p_\theta(x_0 \mid x_1) - \log \frac{q(x_{T} \mid x_0)}{p_\theta(x_{T})} - \sum_{t=2}^T \log \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right]\]The true posterior $q(x_{t-1} \mid x_t, x_0)$ can be computed using Bayes’ rule and the Markov property of the forward chain:
\[q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}) q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}\]Since all three terms in the right sides are Gaussian distribution, the posterior $ q(x_{t-1} \mid x_t, x_0) $ is also Gaussian:
\[q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\left( x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I \right)\]with mean and and variance:
\[\tilde{\mu}_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right) \quad , \quad \tilde{\beta}_t = \frac{(1 - \bar{\alpha}_{t-1}) \beta_t}{1 - \bar{\alpha}_t}\label{eq:14}\]Since our goal is to approximate distribution $ p_{\theta}(x_{t-1} \mid x_t) $ using distribution $q(x_{t-1} \mid x_t, x_0)$, this means that our reverse denoising process must also follow a Gaussian distribution, with the same mean and variance. Because the added noise in the reverse process is unknown, the training objective of DDPM is to predict this noise. We use $\epsilon_{\theta}(x_t, t)$ to represent the predicted noise output, substituting $ \epsilon $ with the noise predicted $\epsilon_{\theta}(x_t, t)$:
\[p_{\theta}(x_{t-1} \mid x_t) = \mathcal{N}\left( x_{t-1}; {\mu}_{\theta}(x_t, x_0), {\beta}_{\theta} I \right)\]where:
\[{\mu}_{\theta}(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \ \ , \ \ {\beta}_{\theta}=\tilde{\beta}_t = \frac{(1 - \bar{\alpha}_{t-1}) \beta_t}{1 - \bar{\alpha}_t}\]Sampling is the direct inverse of the forward process. One starts with pure noise $x_T \sim \mathcal{N}(0, I)$ and iteratively applies the learned reverse step for $ t = T, T-1, \dots, 1$, using the noise prediction $\epsilon_{\theta}(x_t, t)$ at each step to denoise the sample until a clean $x_0$ is obtained.
\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sqrt{\frac{(1 - \bar{\alpha}_{t-1}) \beta_t}{1 - \bar{\alpha}_t}} \epsilon\]2.2.3 Theoretical Elegance and Practical Bottleneck
DDPMs offered a strong theoretical foundation, stable training, and state-of-the-art sample quality. However, this came at a steep price: the curse of a thousand steps. The model’s theoretical underpinnings relied on the Markovian assumption and small step sizes, forcing the sampling process to be painstakingly slow and computationally expensive.
2.3 The Initial Convergence: Tying the Two Worlds Together
For a time, these two paradigms—one estimating a continuous gradient field, the other reversing a discrete noise schedule—seemed distinct. Yet, a profound and simple connection lay just beneath the surface. It was shown that the two seemingly different objectives were, in fact, two sides of the same coin.
The score function $s(x_t, t)$ at a given noise level is directly proportional to the optimal noise prediction $\epsilon(x_t, t)$:
\[s_{\theta}(x_t, t) = \nabla_{x_t} \log p(x_t) = -\frac{\epsilon_{\theta}(x_t, t)} { \sigma_t}\]where $\sigma_t$ is the standard deviation of the noise at time $t$.
This equivalence is beautiful. Predicting the noise is mathematically equivalent to estimating the score. The score-based view provides the physical intuition of climbing a probability landscape, while the DDPM view provides a stable training objective and a concrete, discrete-time mechanism.
With this link established, the two parallel streams began to merge into a single, powerful river. This convergence set the stage for the next major leap: breaking free from the rigid, one-step-at-a-time sampling of DDPM, and ultimately, the development of a unified SDE/ODE theory that could explain and improve upon both.
3. The First Leap in Efficiency: Breaking the Markovian Chain with DDIM
The stunning quality of DDPMs came at a steep price: the rigid, step-by-step Markovian chain. This constraint, while theoretically elegant, was a practical nightmare, demanding a thousand or more sequential model evaluations for a single image. The field desperately needed a way to accelerate this process without a catastrophic loss in quality. The answer came in the form of Denoising Diffusion Implicit Models (DDIM) 6, a clever reformulation that fundamentally altered the generation process.
3.1 From Stochastic Denoising to Deterministic Prediction
The core limitation of the DDPM sampler lies in its definition of the reverse process, which models $p_{\theta}(x_{t-1} \mid x_t)$. This conditional probability is what creates the strict, one-step-at-a-time dependency.
The DDIM paper posed a brilliant question: what if we don’t try to predict $x_{t-1}$ directly? What if, instead, we use our noise-prediction network $\epsilon_{\theta}(x_t, t)$ to make a direct guess at the final, clean image $x_0$? This is surprisingly straightforward. Given $x_t$ and the predicted noise $\epsilon_{\theta}(x_t, t)$, we can rearrange the forward process formula to solve for an estimated $x_0$, which we’ll call $\hat{x}_0$:
\[\hat{x}_0 = \frac{(x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_{\theta}(x_t, t))}{\sqrt{\bar{\alpha}_t}}\]This single equation is the conceptual heart of DDIM. By first estimating the final destination $x_0$, we are no longer bound by the previous step. We have a “map” that points from our current noisy location $x_t$ directly to the origin of the trajectory. This frees us from the Markovian assumption and opens the door to a much more flexible generation process.
3.2 The Freedom to Jump: Non-Markovian Skip-Step Sampling
With an estimate of pred $x_0$ in hand, DDIM constructs the next sample $x_{t-1}$ in a completely different way. It essentially says: “Let’s construct $x_{t-1}$ as a weighted average of three components”:
The “Clean Image” Component: The estimated final image predicted $\hat{x_0}$, pointing towards the data manifold.
The “Noise Direction” Component: The direction pointing from the clean image $x_0$ back to the current noisy sample $x_t$, represented by $\epsilon_{\theta}(x_t, t)$.
A Controllable Noise Component: An optional injection of fresh random noise.
The DDIM update equation is as follows:
\[x_{t-1} = \sqrt{\alpha_{t-1}} \hat{x_0} + \sqrt{1-\alpha_{t-1}-\sigma_t^2}\epsilon_{\theta}(x_t, t) + \sigma_t z_t\label{eq:20}\]Here, \(z_t\) is random noise, and $\sigma_t$ is a new hyperparameter that controls the amount of stochasticity. $\sigma_t = \eta \tilde{\beta}_t$, where $\tilde{\beta}_t$ is defined in DDPM (Equation \ref{eq:14})
This formulation \ref{eq:20} is non-Markovian because the calculation of $x_{t-1}$ explicitly depends on predicted $x_0$, which is an estimate of the trajectory’s origin, not just the immediately preceding state $x_t$. Because this process is no longer Markovian, the “previous” step doesn’t have to be $t-1$. We can choose an arbitrary subsequence of timesteps from the original $1, \dots, T$, for example, $1000, 980, 960, …, 20, 0$. Instead of taking 1000 small steps, we can now take 50 large jumps. At each jump from $t$ to a much earlier $s < t$, the model predicts $x_0$ and then uses the DDIM update rule to deterministically interpolate the correct point $x_s$ on the trajectory. This “skip-step” capability was a game-changer. It allowed users to drastically reduce the number of function evaluations (NFEs) from 1000 down to 50, 20, or even fewer, providing a massive speedup with only a minor degradation in image quality.
Formula \ref{eq:20} also reveals an important theory: for any $x_{t-1}$, the marginal probability density obtained through DDIM sampling is consistent with the marginal probability density of DDPM.
\[q(x_{t-1} \mid x_0 ) = \mathcal{N}(x_{t-1}; \sqrt{\bar{\alpha}_{t-1}} {x_0}, (1-\bar{\alpha}_{t-1})I)\]This is why DDIM can take large jumps and still produce high-quality images using a model trained for the one-step DDPM process.
3.3 The $\eta$ Parameter: A Dial for Randomness
DDIM introduced another powerful feature: a parameter denoted as $\eta$ that explicitly controls the stochasticity of the sampling process.
$\eta = 0$: (Deterministic DDIM): When $\eta = 0$, the random noise term in the update rule is eliminated. For a given initial noise $x_T$, the sampling process will always produce the exact same final image
x_0
. This is the “implicit” in DDIM’s name, as it defines a deterministic generative process. This property is incredibly useful for tasks that require reproducibility or image manipulation (like SDEdit), as the generation path is fixed.$\eta = 1$ (Stochastic DDIM): When $\eta = 1$, DDIM adds a specific amount of stochastic noise at each step. It was shown that this choice makes the marginal distributions $p(x_t)$ match those of the original DDPM sampler. It behaves much like DDPM, offering greater sample diversity at the cost of reproducibility.
$0 < \eta < 1$ (Hybrid Sampling): Values between 0 and 1 provide a smooth interpolation between a purely deterministic path and a fully stochastic one, giving practitioners a convenient “dial” to trade off between diversity and consistency.
3.4 Bridging to a Deeper Theory: The Lingering Question
DDIM was a massive practical leap forward. It provided the speed that was desperately needed and introduced the fascinating concept of a deterministic generative path. However, its derivation, while mathematically sound, was rooted in the discrete-time formulation of DDPM. The deterministic path ($\eta=0$) worked exceptionally well, but it raised a profound theoretical question:
What is this deterministic path? If a continuous process underlies generation, can we describe this path more fundamentally?
The success of deterministic DDIM strongly hinted that the stochasticity of Langevin Dynamics and DDPM sampling might not be strictly necessary. It suggested the existence of a more fundamental, deterministic flow from noise to data. Explaining the origin and nature of this flow was the next great challenge.
This question sets the stage for our next chapter, where we will discover the grand, unifying framework of Stochastic and Ordinary Differential Equations (SDEs/ODEs), which not only provides the definitive answer but also reveals that Score-Based Models and DDPMs were two sides of the same mathematical coin all along.
4. A Confluence of Paths: The Unifying Framework of SDEs and ODEs
The insight behind DDIM—that sampling can follow a deterministic path derived from a learned vector field—opens the door to a more general formulation: diffusion models can be expressed as continuous-time stochastic or deterministic processes. In this chapter, we formalize that insight by introducing the Stochastic Differential Equation (SDE) and the equivalent Probability Flow Ordinary Differential Equation (ODE) perspectives. This unified view provides the theoretical bedrock for nearly all subsequent developments in diffusion sampling, including adaptive solvers, distillation, and consistency models.
4.1 The Dimensional Lift: From Discrete Steps to Continuous Time
The core insight is to re-imagine the diffusion process not as $T$ discrete steps, but as a continuous process evolving over a time interval, say $t \in [0, T]$. In this view, $x_0$ is the clean data at $t=0$, and $x_T$ is pure noise at $t=T$. The forward process is no longer a chain but a trajectory governed by a Stochastic Differential Equation (SDE) that continuously injects noise.
\[dx_t=f(x_t, t)dt+g(t)dw_t\]Where $f(x_t, t)$ is a drift term, which describes the deterministic part of the SDE’s evolution. $g(t)$ is a diffusion coefficient, which scales the magnitude of the random noise. $dw_t$ is a standard Wiener process, $dw_t=\sqrt{dt}\epsilon,\,\epsilon \sim \mathcal{N}(0, I)$.
This continuous formulation is incredibly powerful because it allows us to leverage the mature and rigorous mathematics of stochastic calculus. More importantly, it was proven by Song et al. in their seminal work that there exists a corresponding reverse-time SDE that can transform pure noise at $t=T$ back into data at $t=0$. This reverse SDE is the true, continuous-time “master equation” for generation.
\[dx_t = \left[ f(x_t, t) - g(t)^2 s_{\theta}(x_t, t) \right] dt + g(t) d{\bar w}_t\]Where $s_{\theta}(x_t, t)$ is the score function, learned by the neural network. $d{\bar w}_t$ An infinitesimal “kick” from a standard Wiener process (random noise).
This single equation serves as the grand unifier for the stochastic samplers from section \ref{section2}. Both Annealed Langevin Dynamics (from NCSN) and the DDPM sampling procedure can be shown to be different numerical discretization schemes for this exact same SDE. They were, all along, just two different ways to approximate the solution to one underlying continuous process.
4.2 The Probability Flow ODE: Discovering the Deterministic Path
The reverse SDE provides a complete picture for stochastic generation, but what about the deterministic path discovered by DDIM? This is where the second part of the unified theory comes into play.
For any given SDE that describes the evolution of individual random trajectories, there exists a corresponding Ordinary Differential Equation (ODE) that describes the evolution of the probability density of those trajectories. This is known as the Probability Flow (PF) ODE. While the SDE traces a jagged, random path, the PF-ODE traces a smooth, deterministic “flow” of probability from the noise distribution to the data distribution.
The PF-ODE corresponding to the reverse SDE is:
\[dx_t = \left[ f(x_t, t) - \frac{g(t)^2}{2} s_{\theta}(x_t, t) \right] dt\]Notice the two critical differences from the SDE:
- The diffusion term is gone. There is no more stochastic noise injection. The process is entirely deterministic.
- The score term $s_{\theta}$ is scaled by $\frac{1}{2}$. This factor arises directly from the mathematical conversion (via the Fokker-Planck equation) from the SDE to its deterministic counterpart.
This ODE is the definitive answer to the puzzle of DDIM. The deterministic path ($\eta = 0$) that DDIM so effectively approximates is, in fact, a trajectory of this very Probability Flow ODE. The DDIM update rule is a specific (and quite effective) numerical solver for this equation.
4.3 Theoretical Completion: A New Paradigm for Sampling
The SDE/ODE framework marks a moment of theoretical completion and a paradigm shift. We now have a single, coherent view that elegantly encompasses all previous methods:
The generation process of any diffusion model can be described by a continuous-time differential equation. Stochastic samplers like DDPM and Langevin Dynamics are discretizations of the reverse SDE. Deterministic samplers like DDIM are discretizations of the corresponding Probability Flow ODE.
This realization is transformative. The problem of sampling has been fundamentally reframed. We are no longer limited to thinking about reversing a Markov chain or simulating particle physics. Instead, the task is now clear:
To generate a sample, we must solve a specific, known Ordinary Differential Equation, starting from a random point $x_T$ and integrating backwards in time to $t=0$.
This new perspective immediately begs the next question. The world of numerical analysis has spent over a century developing sophisticated methods for solving ODEs—from simple Euler methods to high-order Runge-Kutta schemes. Can we simply apply these off-the-shelf, classical solvers to the PF-ODE and achieve even greater speed and accuracy?
As we will see in the next chapter, the answer is surprisingly complex. The unique properties of the diffusion PF-ODE present significant challenges to standard numerical methods, necessitating the development of a new class of specialized, “diffusion-aware” solvers.
5. Solver Engineering: Tailor-Made Algorithms for the Probability Flow ODE
The unification under the Probability Flow (PF) ODE framework, as established in sction [4], reframed sampling as a numerical integration problem:
\[d\mathbf{x} = \left[ \mathbf{f}(t)\mathbf{x} - \frac{g^2(t)}{2} \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) \right] dt\]The immediate temptation is to apply classical numerical methods to solve this equation and thereby complete the sampling. However, their failure in this context is not merely empirical; it is rooted in a fundamental mismatch with the properties of this specific ODE.
5.1 The “Misfit”: Why Classical ODE Solvers Are Ill-Suited
Let’s formalize the challenges. Consider a generic ODE $dx/dt = F(x, t)$. A standard $k$-th order Runge-Kutta solver approximates the solution with a local truncation error of $O(h^{k+1})$, where $h$ is the step size. This guarantee holds if the higher-order derivatives of $F$ are well-behaved. For the diffusion PF-ODE, this assumption breaks down.
The theoretical elegance of the PF-ODE framework, $dx/dt = F(x, t)$, naturally suggests applying the vast arsenal of classical numerical methods. However, a deeper analysis reveals a fundamental mismatch between the assumptions of these solvers and the realities of diffusion sampling. This mismatch occurs across multiple dimensions: the objective, the problem’s numerical properties, and the very nature of the “derivative” function itself.
5.1.1 Mismatch in Objective: Path Fidelity vs. Endpoint Fidelity
Classical ODE solvers are designed for path fidelity. Their primary goal is to ensure that the approximate solution $x_n$ remains close to the true trajectory $x(t_n)$ at every intermediate step. The error is rigorously controlled throughout the integration.
Diffusion sampling, however, has a fundamentally different objective: endpoint fidelity. We do not care if the intermediate sample $x_t$ at step $t=500$ is a perfect representation of the true marginal distribution $p_500(x)$. What we ultimately care about is that the final sample, $x_0$, is a high-quality (FID/sFID/CLIP, perceptual realism), plausible element from the target data distribution $p_0(x)$. This philosophical difference means we can tolerate large local truncation errors, provided they do not catastrophically destabilize the process and still guide us toward the correct final manifold. This invalidates the core optimization target of classical methods and opens the door for aggressive, large-step “jumping” methods that would be considered unacceptable in a traditional scientific computing context.
5.1.2 Mismatch in Objective: Path Fidelity vs. Endpoint Fidelity
In a typical textbook ODE problem, the derivative function $F(x, t)$ is an analytical expression, and its evaluation is computationally trivial. For diffusion models, $F(x, t)$ is defined by the score function $s_{\theta}(x, t)$ or the noise prediction $ε_θ(x, t)$, which requires a full forward pass through a massive neural network (e.g., a U-Net). This makes each function evaluation extremely expensive.
This high cost immediately penalizes classical high-order explicit methods. For instance, a 4th-order Runge-Kutta (RK4) solver requires 4 NFEs per step. To generate an image in 10 steps, an RK4 solver would consume $10 \times 4 = 40$ NFEs. A simpler 1st-order Euler solver could take 40 steps with the same budget, often yielding a better result because it can follow the trajectory’s high curvature more closely with smaller, more frequent updates. The goal is not just high order, but the best quality for a minimal total NFE budget.
5.1.3 Numerical Stiffness and the Curse of Dimensionality
The diffusion PF-ODE is a notoriously stiff problem. The Jacobian of the vector field has eigenvalues that span many orders of magnitude. This means some components of the solution change very rapidly while others change slowly. For explicit solvers, the step size $h$ is constrained by the largest eigenvalue ($h < C/|{\lambda}_{\text max}|$) to maintain stability. For stiff problems, this forces $h$ to be prohibitively small.
The standard textbook solution for stiff ODEs is to use implicit methods (e.g., Backward Euler, BDF solvers). These methods have much better stability but require solving a system of equations at each step, often involving the Jacobian. Here, we encounter the Curse of Dimensionality. For an image of size $256x256x3$, the dimension $D$ of our state vector $x$ is $196,608$. The Jacobian is a $D x D$ matrix, which is far too large to store, let alone invert. Therefore, the standard solution for stiffness is computationally infeasible, necessitating the invention of novel methods that can handle stiffness without being implicit.
5.1.4 The Imperfect Oracle: Fitting a Noisy, Learned Function
A critical, often overlooked, assumption of classical solvers is that the function $F(x, t)$ is a precise, ground-truth oracle. In our case, $F(x, t)$ depends on $\epsilon_{\theta}(x, t)$, which is a neural network’s approximation of the true, unknown score function. It contains both approximation error (from the network’s finite capacity) and potentially training error.
Applying a very high-order solver to this imperfect, potentially “noisy” vector field is akin to fitting a high-degree polynomial to a small number of noisy data points. The solver might start to “overfit” to the meaningless wiggles and inaccuracies in the network’s output, rather than capturing the true underlying dynamics. This suggests there is a ceiling on the useful order of a solver; beyond a certain point, increasing the order may only serve to amplify the model’s intrinsic error rather than reducing the numerical integration error. This favors solvers that are robust to inaccuracies in the derivative estimate.
References
Aapo Hyvärinen, “Estimation of non-normalized statistical models by score matching”, JMLR, 2005. ↩
Vincent P. A connection between score matching and denoising autoencoders[J]. Neural computation, 2011, 23(7): 1661-1674. ↩
Yang Song and Stefano Ermon. “Generative Modeling by Estimating Gradients of the Data Distribution”. NeurIPS 2019. ↩ ↩2
Parisi G. Correlation functions and computer simulations[J]. Nuclear Physics B, 1981, 180(3): 378-384. ↩
Grenander U, Miller M I. Representations of knowledge in complex systems[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1994, 56(4): 549-581. ↩
Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020. ↩