High-Order PF-ODE Solver in Diffusion Models

40 minute read

Published: June 08, 2025

📚 Table of Contents

1. Introduction
2. Numerical ODE Solver
3. Pseudo-Numerical Methods for Diffusion Models
4. High-Order and Semi-Analytic Solvers
5. References

Diffusion sampling via the Probability Flow Ordinary Differential Equation (PF-ODE) appears to be a standard ODE solving problem, but off-the-shelf ODE numerical methods rarely achieve the optimal balance between speed and quality. This article, based on core numerical analysis concepts (single-step vs. multi-step, explicit vs. implicit, as well as the roles of LTE and GTE), reviews traditional ODE numerical methods and explains why they are not suitable for diffusion model sampling. Then, the article explores, tailored to the semi-linear structural characteristics of PF-ODE, and introduces diffusion model-specific samplers, including pseudo-numerical samplers (PLMS, PNDM) and high-order/semi-analytical methods (DEIS, DPM-Solver/++, UniPC).

1. Introduction

Diffusion models involves a forward diffusion process that gradually adds Gaussian noise to data samples until they become pure noise, and a reverse process that learns to denoise this noise back to generate new samples.

Mathematically, the forward process can be modeled as a stochastic differential equation (SDE):

\[dx = f(x, t) \, dt + g(t) \, dw\]

where $x$ is the state variable (e.g., an image), $t$ is the time step, $f(x, t)$ is the drift term, $g(t)$ is the diffusion coefficient, and $dw$ represents Wiener process. The reverse process, used for sampling, is another SDE that approximates the time-reversal of the forward SDE, incorporating a score function $\nabla_x \log p_t(x)$ (the gradient of the log-probability density), which is estimated by a neural network trained via score matching.

A crucial insight is that this reverse SDE has an equivalent deterministic representation through the probability flow ordinary differential equation (PF ODE):

\[\frac{dx}{dt} = f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x)\]

This equivalence stems from the fact that the PF ODE is derived to match the Fokker-Planck equation (which describes the evolution of probability densities) of the original SDE, ensuring that trajectories generated by solving the ODE backward (from $t = T$ at pure noise to $t = 0$ at clean data) produce the same marginal probability distributions as the stochastic SDE paths, but without injecting additional randomness. Thus, sampling reduces to numerically integrating this ODE, making the process deterministic and potentially more efficient, as it avoids the variance introduced by stochastic sampling while preserving the generative quality.

In this post, we first review the numerical methods for solving ODEs. Then, we analyze why we do not directly use ODE numerical solvers for sampling in diffusion models. Finally, we explore how to construct an efficient sampler based on the properties of PF-ODE for sampling.

2. Numerical ODE Solver

Numerical methods convert a continuous‐time initial-value problem into a sequence of discrete algebraic updates that march the solution forward in small time steps.

Numerical ODE solvers work by discretizing the continuous time domain into a sequence of time points: $t_0, t_1, …, t_{n-1}, t_n$, the interval between any two adjacent time steps is $h$, i,e,. $t_i=t_{i-1}+h$. Given an initial-value problem:

\[\frac{dx}{dt}=f(t,x),\ \ \ \ x(t_0)=x_0\]

the Fundamental Theorem of Calculus rewrites the update over one step $h$:

\[x_{t_{i+1}}=x_{t_i}+\int_{t_i}^{t_{i+1}}f(t,x)dt\]

Because the exact integrand $f(t,x)$ is unknown (it involves the unknown path $x$), numerical schemes replace that integral with a tractable quadrature formula built from sample slopes. The essential difference between different numerical methods lies in the different strategies they use to approximate this integral.

2.1 One-step vs Multi-step

This research dimension answers the question: “How much historical information is needed to computer $x_{t_{n+1}}$?”

2.1.1 One-step Methods

A one-step method uses only the information from the single previous point $(t_n, x_{t_n})$ to compute $x_{t_{n+1}}$, it does not care about earlier points like $x_{t_{n-1}}$, $x_{t_{n-2}}$, etc. That is to say, The information for $x_{t_{n+1}}$ is determined entirely by $x_{t_n}$ and some estimates of the slope within the interval $[t_n, t_{n+1}]$, we formalize as general form $x_{t_{n+1}}=\Phi(t_n, x_{t_n}, h)$

Some classic numerical one-step methods are listed as follows:

Methods	Order	NFE	Sampling Points	Update (explicit form)
Euler	1	1	$t_n $	$x_{t_{n+1}} = x_{t_n} + h*f(t_n, x_{t_n})$
Heun (RK2)	2	2	$t_n \\ t_n+h$	$k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+{h}, x_{t_n}+{h}k_1) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{2}(k_1+k_2)$
RK3	3	3	$t_n \\ t_n+\frac{h}{2} \\ t_n+h$	$k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_1) \\ k_3=f(t_n+h, x_{t_n}-hk_1+2hk_2) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{6}(k_1+4k_2+k_3)$
RK4	4	4	$t_n \\ (t_n+\frac{h}{2})(2\times) \\ t_n+h$	$k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_1) \\ k_3=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_2) \\ k_4=f(t_n+h, x_{t_n}+hk_3) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{6}(k_1+2k_2+2k_3+k_4)$

Geometrically, The integral on the right equals the signed area enclosed by the curve $f(t,x)$, the $t$-axis, and the vertical lines $t=t_i$ and $t=t_{i+1}$. Higher order numerical methods guarantee a better asymptotic error bound when all other factors (step size, stability, constants, arithmetic) are favourable.

Area Approximations of 4 Common Explicit ODE Methods

However, in real problems those factors often dominate, so a lower-order method can outperform a higher-order one.

2.1.2 Multi-step Methods

A multi-step method uses not only the information from the single previous point $(t_n, x_{t_n})$, but also from previous points, such as $(t_{n-1}, x_{t_{n-1}}), (t_{n-2}, x_{t_{n-2}})$, etc. We formalize this as a general form $x_{t_{n+1}}=\Phi(h, f_{n}, f_{n-1}, f_{n-2},\dots)$, where $f_i = f(t_i, x_{t_i})$. By leveraging historical slope evaluations ($f$ values from prior steps), these methods can achieve higher-order accuracy with fewer new function evaluations per step, making them more efficient for problems where $f$ is expensive to compute (like neural network calls in diffusion models). However, they require an initial “bootstrap” phase (often using one-step methods to generate the first few points) and can be less stable for non-smooth or stiff problems due to error propagation from history.

Traditional multi-step methods include the Adams family (Adams-Bashforth for explicit, Adams-Moulton for implicit) and Backward Differentiation Formulas (BDF) for stiff systems. Some classic numerical multi-step methods are listed as follows (focusing on explicit Adams-Bashforth variants for comparison with one-step methods; note that after bootstrapping, they typically require only 1 NFE per step as they reuse past $f$ values):

Methods	Order	NFE	Sampling Points	Update (explicit form)
Adams-Bashforth 1 (AB1, equivalent to Euler)	1	1	$t_n$	$x_{t_{n+1}}=x_{t_n}+hf(t_n,x_{t_n})$
Adams-Bashforth 2 (AB2)	2	1	$t_n, t_{n-1}$	$x_{t_{n+1}}=x_{t_n}+\frac{h}{2}[3f(t_n,x_{t_n})−f(t_{n−1},x_{t_{n−1}})]$
Adams-Bashforth 3 (AB3)	3	1	$t_n, t_{n-1}, t_{n-2}$	$x_{t_{n+1}}=x_{t_n}+\frac{h}{12}[23f(t_n,x_{t_n})−16f(t_{n−1},x_{t_{n−1}})+5f(t_{n−2},x_{t_{n−2}})]$
Adams-Bashforth 4 (AB4)	4	1	$t_n, t_{n-1}, t_{n-2}, t_{n-3}$	$x_{t_{n+1}}=x_{t_n}+\frac{h}{24}[55f(t_n,x_{t_n})−59f(t_{n−1},x_{t_{n−1}})+37f(t_{n−2},x_{t_{n−2}})−9f(t_{n−3},x_{t_{n−3}})]$

Geometrically, multi-step methods approximate the integral by fitting a polynomial interpolant through multiple past slope points, extrapolating forward. This can provide a more accurate area estimate under the curve $f(t,x)$ compared to one-step methods, especially for smooth functions, but it assumes the history is reliable—errors in early steps can compound over time.

The following figures Approximate $\int_{t_n}^{t_{n+1}}f(s, x(s))ds$ by integrating the Lagrange interpolant. The shaded region is the predicted area that advances the solution.

Area Approximations of 4 ABK Methods

However, in practice, multi-step methods shine in scenarios with consistent step sizes and non-stiff ODEs, as they reduce computational overhead by recycling evaluations. For diffusion PF-ODEs, their efficiency inspires pseudo-multi-step samplers (e.g., PLMS), but stiffness often necessitates hybrid approaches with correctors or adaptive stepping.

2.2 Implicit vs Explicit

This research dimension aims to answer the question: “Is the formula for $x_{t_{n+1}}$ is a direct calculation or an equation to be solved?”, that is to say, whether $x_{t_{n+1}}$ appears on both sides of the equation simultaneously.

2.2.1 Explicit

Explicit methods refers to the formula for $x_{t_{n+1}}$ is an explicit expression, where the unknown $x_{t_{n+1}}$ appears only on the left-hand side. We can directly plug in known values on the right-hand side to compute $x_{t_{n+1}}$.

\[x_{t_{n+1}}=x_{t_n}+h\Phi(t_n, x_{t_n}, f_n, f_{n-1}, \dots)\]

Where $f_n=f(t_n, x_{t_n})$. Examples include Forward Euler, explicit Runge–Kutta (Heun/RK3/RK4), and Adams–Bashforth (AB) multi-step schemes

2.2.2 Implicit

Implicit methods refers to the formula for $x_{t_{n+1}}$ is an equation, where the unknown $x_{t_{n+1}}$ appears on both sides of the equals sign. Typical examples including backward euler method, Adams-Moulton families.

\[x_{t_{n+1}}=x_{t_n}+h\Phi(t_{n+1}, x_{t_{n+1}}; t_n, x_{t_n}, f_n, f_{n-1}, \dots)\]

An implicit method requires solving a (potentially nonlinear) equation at each step to find $x_{t_{n+1}}$, often using iterative techniques like Newton-Raphson. This increases computational cost per step but enhances stability, making them suitable for stiff ODEs common in diffusion models.

Some classic implicit numerical methods are listed as follows, focusing on Runge-Kutta family examples (which can be one-step; multi-step implicit methods like Adams-Moulton are also common):

Methods	Order	NFE (per iteration)	Sampling Points	Update (implicit form)
Backward Euler	1	1 (plus solver iterations)	$t_{n+1}$	$x_{t_{n+1}} = x_{t_n} + h \cdot f(t_{n+1}, x_{t_{n+1}})$
Trapezoidal (Implicit RK2)	2	1 (plus solver iterations)	$t_n \ t_{n+1}$	$x_{t_{n+1}} = x_{t_n} + \frac{h}{2} \left[ f(t_n, x_{t_n}) + f(t_{n+1}, x_{t_{n+1}}) \right]$
Implicit Midpoint (Gauss-Legendre RK2)	2	1 (plus solver iterations)	$t_n + \frac{h}{2}$	$k = f\left(t_n + \frac{h}{2}, x_{t_n} + \frac{h}{2} k\right) \ x_{t_{n+1}} = x_{t_n} + h k$
Radau IIA (Implicit RK3)	3	2 (plus solver iterations)	$t_n + \frac{h}{3} \ t_n + h$	Solve for $k_1, k_2$: \ $k_1 = f\left(t_n + \frac{h}{3}, x_{t_n} + \frac{5h}{12} k_1 - \frac{h}{12} k_2\right) \ k_2 = f\left(t_n + h, x_{t_n} + h k_1 + h k_2 - h k_1\right)$ (simplified Butcher tableau form; full details in literature) \ $x_{t_{n+1}} = x_{t_n} + \frac{3h}{4} k_1 + \frac{h}{4} k_2$

2.3 Local Truncation Error and Global Truncation Error

In numerical methods for solving ordinary differential equations (ODEs), errors arise because we approximate the continuous solution with discrete steps. Two key concepts are the Local Truncation Error (LTE) and the Global Truncation Error (GTE), which are the common indicators used to measure errors.

Local Truncation Error (LTE): This is the error introduced in a single step of the numerical method, assuming the solution at the previous step is exact. It measures how well the method approximates the true solution over one step, based on the Taylor series expansion of the exact solution.
Mathematically, if the exact solution at $t_n$ is $x(t_n)$, the LTE at step $n+1$ is:
\[\tau_{n+1} = x(t_{n+1}) - x_{n+1}^{\text{approx}}\]
where $x_{n+1}^{\text{approx}}$ is computed using the method with exact input $x(t_n)$.
Global Truncation Error (GTE): This is the total accumulated error at the end of the integration interval (e.g., from $t_0$ to $T$), considering that errors from previous steps propagate forward. It depends on the number of steps $N = (T - t_0)/h$, and for stable methods, GTE is typically $O(h^p)$ if LTE is $O(h^{p+1})$. The relationship is roughly GTE $\approx$ (number of steps) $\times$ LTE, but propagation can amplify or dampen it.

Below, we will use the Euler method and the Henu method as examples to demonstrate how to obtain its LTE and GTE.

2.3.1 Example with Euler Method

The forward Euler method is a first-order method:

\[x_{n+1} = x_n + h f(t_n, x_n)\]

Deriving LTE

Assume exact input: Start with $x_n = x(t_n)$.
The method approximates: $x_{n+1}^{\text{approx}} = x(t_n) + h f(t_n, x(t_n))$.
From Taylor: $x(t_n + h) = x(t_n) + hf(t_n, x(t_n)) + \frac{h^2}{2} x’‘(t_n) + O(h^3)$.
Subtract: LTE $\tau_{n+1} = x(t_n + h) - x_{n+1}^{\text{approx}} = \frac{h^2}{2} x’‘(t_n) + O(h^3)$.
Thus, LTE = $O(h^2)$. (The leading term is quadratic in $h$.)

This shows how to arrive: The Euler update matches the first two Taylor terms (constant + linear), so the error starts from the quadratic term.

Deriving GTE

Let $e_n = x(t_n) - x_n$ be the global error at step $n$.
The error recurrence: $e_{n+1} = e_n + h [f(t_n, x(t_n)) - f(t_n, x_n)] + \tau_{n+1}$.
For Lipschitz-continuous $f$ (with constant $L$), $|f(x(t_n)) - f(x_n)| \leq L |e_n|$, so: $\|e_{n+1}\| \leq \|e_n\| (1 + h L) + \|\tau_{n+1}\|$
Since $\tau_{n+1} = O(h^2)$, over $N$ steps ($N \approx 1/h$), the accumulated error bounds to $|e_N| \leq C h$ for some constant $C$ (from summing the geometric series of propagated local errors).
Thus, GTE = $O(h)$. (Errors accumulate linearly with $1/h$ steps, reducing the order by 1.)

Cause: Each local error adds up, and propagation (via $1 + hL$) amplifies like compound interest, but stability ensures it’s bounded by $O(h)$.

2.3.2 Example with Heun Method

The Heun method (a second-order Runge-Kutta) improves on Euler with a predictor-corrector:

\[k_1 = f(t_n, x_n), \quad k_2 = f(t_n + h, x_n + h k_1), \quad x_{n+1} = x_n + \frac{h}{2} (k_1 + k_2)\]

Deriving LTE

Assume exact input: $x_n = x(t_n)$.
Expand $k_1 = f(t_n, x(t_n)) = x’(t_n)$.
Predictor: $x_n + h k_1 = x(t_n) + h x’(t_n)$.
$k_2 = f(t_n + h, x(t_n) + h x’(t_n))$. Taylor-expand $f$ in two variables:
\[\begin{align} f(t_n + h, x(t_n) + h x'(t_n)) = & f(t_n, x(t_n)) + h \left( f_t + x'(t_n) f_x \right) \\[10pt] & + \frac{h^2}{2} \left( f_{tt} + 2 x' f_{tx} + (x')^2 f_{xx} \right) + O(h^3) \end{align}\]
But since $x’’ = \frac{\partial f}{\partial t} + x’ \frac{\partial f}{\partial x}$, $k_2 = x’ + h x’’ + \frac{h^2}{2} x’’’ + O(h^3)$.
Then, $\begin{align} x_{n+1}^{\text{approx}} & = x(t_n) + \frac{h}{2} (x' + x' + h x'' + \frac{h^2}{2} x''' + O(h^3)) \\[10pt] & = x(t_n) + h x' + \frac{h^2}{2} x'' + \frac{h^3}{4} x''' + O(h^4) \end{align}$
True: $x(t_n + h) = x(t_n) + h x’ + \frac{h^2}{2} x’’ + \frac{h^3}{6} x’’’ + O(h^4)$.
Subtract: LTE $\tau_{n+1} = \left( \frac{h^3}{6} - \frac{h^3}{4} \right) x’’’ + O(h^4) = -\frac{h^3}{12} x’’’ + O(h^4)$. Thus, LTE = $O(h^3)$. (Matches up to quadratic term, error from cubic.)

Deriving GTE

Similar recurrence:
\[\begin{align} e_{n+1} =\ & e_n + \frac{h}{2} [ (f(t_n, y(t_n)) - f(t_n, y_n))] \\[10pt] &+ \frac{h}{2}[(f(t_{n+1}, y(t_{n+1})) - f(t_{n+1}, y_n + h k_1)) ] + \tau_{n+1} \end{align}\]
Using Lipschitz $L$, the perturbation terms are $O(e_n)$ and $O(h e_n + e_n)$, leading to $|e_{n+1}| \leq (1 + h L + O(h^2)) |e_n| + |\tau_{n+1}|$.
With $\tau_{n+1} = O(h^3)$, over $N \approx 1/h$ steps, the bound sums to $|e_N| \leq C h^2$ (accumulated as $N \times O(h^3) = O(h^2)$).
Thus, GTE = $O(h^2)$. (Higher local order leads to better global convergence.)

Cause: Fewer accumulations needed for the same accuracy, and the method’s stability (A-stable for some cases) prevents excessive propagation.

In summary, LTE is per-step (higher order means smaller), while GTE is overall (reduced by 1 order due to accumulation). For Euler (order 1), halving $h$ halves GTE; for Heun (order 2), it quarters GTE—making higher-order methods more efficient for accuracy.

2.4 The Limitation of Numerical ODE Solvers for diffusion Sampling

Although the PF ODE formulation enables deterministic sampling, directly applying standard numerical ODE solvers (e.g., explicit Runge-Kutta methods like RK4, Euler methods, or adaptive solvers like Dormand-Prince) is often suboptimal or problematic for diffusion models. These general-purpose solvers do not fully exploit the specific structure of diffusion ODEs, leading to inefficiencies and quality issues, particularly in high-dimensional spaces like image generation. Below, I list the key reasons, drawn from analyses of diffusion sampling challenges:

Numerical Instability and Stiffness Issues: Diffusion PF ODEs are semi-linear and can be stiff, especially in high dimensions, causing standard explicit solvers to become unstable with large step sizes. This results in exploding gradients or divergence, requiring tiny steps that increase computational cost and negate speed advantages.
High Discretization Errors in Few-Step Regimes: With fewer integration steps (e.g., 10–20 instead of 1000), traditional solvers accumulate significant truncation errors, causing the approximated trajectory to deviate from the true ODE path. This leads to degraded sample quality, such as artifacts or lower fidelity, as the solver fails to accurately track the probability flow.
Mismatch with Model Training Objectives: Diffusion models are trained to optimize score-matching losses, not to minimize ODE integration errors. Pursuing better ODE solving can paradoxically worsen perceptual quality, as seen in consistency models where tighter approximations to the PF ODE reduce sample fidelity due to inconsistencies between training and inference.
Inefficiency for Guided or Conditional Sampling: Standard solvers do not inherently handle constraints like classifier guidance or conditional generation efficiently, often requiring additional modifications that increase function evaluations (NFEs) or fail to maintain distribution matching.
Lack of Exploitation of Semi-Linear Structure: Diffusion ODEs have a specific semi-linear form (linear drift plus nonlinear score term), which general solvers ignore, leading to suboptimal performance. Without tailored approximations, they require more NFEs for convergence, making them slower than specialized methods.

These problems motivate the development of specialized solvers that incorporate higher-order approximations, exploit the ODE’s structure, and provide convergence guarantees for fast, high-quality sampling in low-NFE settings.

3. Pseudo-Numerical Methods for Diffusion Models

Now we apply the ODE numerical method to the diffusion model. PF-ODEs in diffusion have a semi-linear form

\[\frac{dx}{dt}=a(t)\,x + b(t)\,\underbrace{\mathcal{M}(x,t)}_{\epsilon_\theta,\ s_\theta,\ \text{or } v_\theta}.\]

The linear drift $a(t)\,x$ is handled analytically via the training schedule (the $\alpha,\bar\alpha$ coefficients), so the step-to-step state update can be written as a closed one-step map (DDIM) that is linear in the model output for the current time. What we improve, then, is only the model output ($\mathcal{M}$) at the current step—using multistep extrapolation/correction—before plugging it into that one-step map.

Based on this observation, PNDM ¹ use the same ideas of classical ODE methods (Section 2) but apply them to the model output only while the linear drift is absorbed by the schedule:

\[\frac{dx}{dt}=a(t)\,x + b(t)\,\underbrace{\mathcal{M}(x,t)}_{\epsilon_\theta,\ s_\theta,\ \text{or }v_\theta} \,\Rightarrow\, x_{t-\Delta}=\Psi\!\big(x_t,\ \widehat{\mathcal{M}}_t;\ \text{schedule}\big)\label{eq:17}\]

$\Psi$ is the schedule-aware one-step map (DDIM). It is linear in the model output and already handles the linear term $a(t)x$ analytically (integrating-factor view).
The only thing we improve is $\widehat{\mathcal{M}}_t$ at the current step—via multistep extrapolation (PLMS) or in-step multi-stage blending (PRK).

Keypoint: treat the linear part exactly, raise the order on the neural part. This is the unifying principle behind the PNDM family.

3.1 PLMS - Pseudo Linear Multistep (Linear multi-step on model outputs)

PLMS ¹ applies an Adams–Bashforth $k$-step formula to the model outputs, producing a more accurate $\widehat{\epsilon}_t$ (or $\widehat{s}_t,\widehat{v}_t$) before calling one step of $\Psi$. Take $\epsilon$-prediction as an example, PLMS improve the model output via multistep extrapolation like any ABK method

\[\hat\epsilon_{t} \;=\; \sum_{j=0}^{k-1} c^{(k)}_j\,\epsilon_{t-j} \quad\text{with}\quad \begin{cases} \text{AB2: } \ \hat\epsilon_t=\tfrac{3}{2}\epsilon_t - \tfrac{1}{2}\epsilon_{t-1}\\[2pt] \text{AB3: } \ \hat\epsilon_t=\tfrac{23}{12}\epsilon_t - \tfrac{16}{12}\epsilon_{t-1}+\tfrac{5}{12}\epsilon_{t-2}\\[2pt] \text{AB4: } \ \hat\epsilon_t=\tfrac{55}{24}\epsilon_t - \tfrac{59}{24}\epsilon_{t-1}+\tfrac{37}{24}\epsilon_{t-2}-\tfrac{9}{24}\epsilon_{t-3} \end{cases}\]

After the model output is improved (from $\epsilon_{t}$ to $\hat\epsilon_{t}$), you then plug $\hat\epsilon_{t}$ into the one-step map $\Psi$ to get $x_{n-1}$. We apply the improved model output $\hat\epsilon_{t}$ to DDIM update

\[\hat x_0=\frac{x_t-\sqrt{1-\bar\alpha_t}\,\widehat{\epsilon}_t}{\sqrt{\bar\alpha_t}},\qquad x_{t-\Delta} =\sqrt{\bar\alpha_{t-\Delta}}\ \hat x_0+\sqrt{1-\bar\alpha_{t-\Delta}}\ \widehat{\epsilon}_t .\]

(For score or $v$ parameterizations, only the linear map between model output and $\hat x_0$ changes.)

like any k-step method, PLMS needs $k-1$ past outputs; initialize with DDIM or a short PRK. AB$k$ gives an $O(h^k)$ approximation to the IF-scaled slope (the model output after integrating-factor). Since $\Psi$ is linear in that output with bounded coefficients, the order transfers to the state (consistency + zero-stability).

After warm-up, PLMS uses one model evaluation per step (reuse history). It is fast and accurate in the 20–50 step regime; pure extrapolation can overshoot under very large steps or strong guidance—where a corrector or PRK warm-up helps.

3.2 PRK — Pseudo Runge–Kutta (in-step multi-stage on model outputs)

The core idea of PRK it to improve model output ($\widehat{\epsilon}_t,\,\widehat{s}_t,\,\widehat{v}_t$ using in-step stages (RK2/RK4) where intermediate states are obtained by $\Psi$ （DDIM update step, Equation \ref{eq:17}）.

PRK2 (Heun-like) in equations

\[\begin{aligned} k_1 &= \epsilon_\theta(x_t,\ t),\\[8pt] x_{\text{mid}} &= \Psi\!\big(x_t,\ k_1;\ \tfrac{\Delta}{2}\big),\\[8pt] k_2 &= \epsilon_\theta\!\big(x_{\text{mid}},\ t-\tfrac{\Delta}{2}\big),\\[8pt] \widehat{\epsilon}_t &= \tfrac{1}{2}\,(k_1+k_2),\\[8pt] x_{t-\Delta} &= \Psi\!\big(x_t,\ \widehat{\epsilon}_t;\ \Delta\big). \end{aligned}\]

PRK4 (RK4-like) in equations

\[\begin{aligned} k_1 &= \epsilon_\theta(x_t,\ t),\\[8pt] x^{(1)} &= \Psi\!\big(x_t,\ k_1;\ \tfrac{\Delta}{2}\big),\qquad k_2 = \epsilon_\theta\!\big(x^{(1)},\ t-\tfrac{\Delta}{2}\big),\\[8pt] x^{(2)} &= \Psi\!\big(x_t,\ k_2;\ \tfrac{\Delta}{2}\big),\qquad k_3 = \epsilon_\theta\!\big(x^{(2)},\ t-\tfrac{\Delta}{2}\big),\\[8pt] x^{(3)} &= \Psi\!\big(x_t,\ k_3;\ \Delta\big),\qquad k_4 = \epsilon_\theta\!\big(x^{(3)},\ t-\Delta\big),\\[8pt] \widehat{\epsilon}_t &= \tfrac{1}{6}\,\big(k_1+2k_2+2k_3+k_4\big),\\[8pt] x_{t-\Delta} &= \Psi\!\big(x_t,\ \widehat{\epsilon}_t;\ \Delta\big). \end{aligned}\]

Compared with PLMS, PRK does not require warm-up. Instead, PRK is often used as a warm-up to seed PLMS history in the first few steps. The higher the order of the simulated RK, the higher the NFE of single-step for PRK.

3.3 Putting them together: the PNDM scheduler

A practical recipe that many implementations follow:

PRK warm-up (1–4 steps). Use RK2/RK4-style PRK to generate reliable early-step outputs and populate the history buffer.
PLMS main loop. Switch to PLMS (AB3/AB4 on model outputs) for speed: 1 NFE/step after warm-up.
(Optional) pseudo-corrector. If steps are very large or guidance is strong, add an AM-style blend on outputs (PECE): predict with AB → move with $\Psi$ → re-evaluate $\epsilon$ at the new end → correct with an AM-like average on outputs → final $\Psi$ move.

Implementation notes (applies to PLMS & PRK)

Parameterization. The same designs work for $\epsilon$, score $s$, or $v$. Only the linear map inside $\Psi$ changes.
Schedules. For non-uniform grids (Karras/EDM/ cosine), compute AB/AM weights from the current local grid; conceptually they are integrals of step-dependent Lagrange bases.
Warm-up. If you skip PRK, a short DDIM or RK2 warm-up suffices to seed AB3/AB4.
Guidance. Strong classifier-free guidance increases effective Lipschitz constants; prefer PRK warm-up and/or output-AM correction.
Cost. PRK $>$ PLMS per step; PLMS wins on long runs. Both keep 1 linear $\Psi$ map per step; no implicit solve on the state is needed.

Takeaway. Pseudo-numerical samplers = high-order on model outputs + one schedule-aware state map. Read through section 2 with this in mind and you’ll see PLMS is the multistep (AB) track, PRK the multi-stage (RK) track, and the optional AM-style blend is the familiar predictor–corrector from classical ODEs—transplanted into diffusion with the integrating-factor lens.

4. High-Order and Semi-Analytic Solvers

The pseudo-numerical methods in section 3 (PNDM/PLMS/PRK) work impressively well in the mid-NFE regime, but they start to show structural limitations precisely when we push for very low NFE or strong classifier-free guidance. Their core idea—polynomial extrapolation of model outputs wrapped inside a one-step DDIM-style map—does not exploit the semi-linear nature of the probability-flow ODE. The linear drift is only approximated, so bias grows as steps get larger; stability regions are narrow, yielding oscillations or divergence under aggressive schedules and high guidance scales; multistep history creates warm-up fragility and makes step-grid changes brittle; and error constants tie tightly to the chosen time parameterization, so moving to non-uniform or log-SNR grids can be hit-or-miss. Most importantly, there is no principled route to raise order while preserving stability and schedule awareness—the coefficients are those of polynomial extrapolators, not of the underlying semi-linear flow. These shortcomings motivate section 4: a family of high-order, structure-exploiting ODE solvers (DEIS, DPM-Solver++, UniPC, etc.) that integrate the linear part exactly and apply higher-order, schedule-aware quadrature to the nonlinear/model part, delivering markedly better accuracy–stability trade-offs at 10–30 NFEs and under strong guidance.

4.1 DEIS

4.2 DPM-Solvers

DPM-Solver ² is a family of high-order, structure-exploiting ODE solvers tailored to the PF-ODE of diffusion models. Instead of handing the entire right-hand side to a black-box ODE solver, it (i) solves the linear part analytically and (ii) changes variables from time $t$ to log-SNR space, which converts the nonlinear part into an exponentially weighted integral of the network output. Approximating that special integral with low-degree polynomials yields 1st/2nd/3rd-order updates that converge in the few-step regime.

4.2.1 Setup: from PF-ODE to a Exponentially Weighted Integral integral

Recall our discussion in previous post, the (backward) PF-ODE written with the forward process $(s,\sigma)$ pair and the score model $s_\theta$:

\[x_t=s(t)x_0+\sigma(t)\epsilon\,\,\Longrightarrow\,\,\frac{dx_t}{dt}=f(t)x_t-\frac{g^2(t)}{2}s_{\theta}(x_t, t)\label{eq:20}\]

with

\[f(t)=\frac{s'(t)}{s(t)}\quad \text{and} \quad g^2(t)=2\sigma(t)\sigma'(t)-2 \frac{s'(t)}{s(t)} \sigma^2(t)\]

Using variation of constants over one step from $t_i$ to $t_{i-1}$ (backward integration), the exact one-step map decomposes into a linear part plus a nonlinear integral:

\[x_{t_{i-1}} = C_1 x_{t_i} + C_2\]

where the linear coefficient is

\[C_1 = \exp\left( \displaystyle \int_{t_i}^{t_{i-1}} f(s) \, \mathrm{d}s \right) = \exp\left( \displaystyle \int_{t_i}^{t_{i-1}} \frac{s'(s)}{s(s)} \, \mathrm{d}s \right) = \frac{s(t_{i-1})}{s(t_i)}\label{eq:23}\]

and the nonlinear term is

\[C_2 = - \exp\left( \displaystyle \int_{0}^{t_{i+1}} f(s) \, \mathrm{d}s \right) \displaystyle \int_{t_i}^{t_{i+1}} \frac{g^2(t)}{2} s_{\theta}(x_t, t) \exp\left( -\displaystyle \int_{0}^{t} f(s) \, \mathrm{d}s \right) \, \mathrm{d}t\]

To further simplify this nonlinear term, we introduce log-SNR $\lambda(t)$:

\[\lambda(t) = \log \frac{s(t)}{\sigma(t)} \quad \Longrightarrow \quad \frac{\mathrm{d}\lambda(t)}{\mathrm{d}t} = \frac{\mathrm{d}\log s(t)}{\mathrm{d}t} - \frac{\mathrm{d}\log \sigma(t)}{\mathrm{d}t}\]

A short calculation shows

\[\begin{align} g^2(t) & =2\sigma(t)\sigma'(t)-2 \frac{s'(t)}{s(t)} \sigma^2(t) \\[10pt] & = 2\sigma^2(t)\frac{\mathrm{d}\log \sigma(t)}{\mathrm{d}t} - 2\sigma^2(t)\frac{\mathrm{d}\log s(t)}{\mathrm{d}t} \\[10pt] & = -2\sigma^2(t)\frac{\mathrm{d}\lambda(t)}{\mathrm{d}t} \end{align}\]

Changing variables $t \mapsto \lambda$ then collapses the nonlinear term into an exponentially weighted integral of $\epsilon_\theta$:

\[\begin{align} C_2 & = - \exp\left( \displaystyle \int_{0}^{t_{i-1}} f(s) \, \mathrm{d}s \right) \displaystyle \int_{t_i}^{t_{i-1}} \frac{g^2(t)}{2} s_{\theta}(x_t, t) \exp\left( -\displaystyle \int_{0}^{t} f(s) \, \mathrm{d}s \right) \, \mathrm{d}t \\[10pt] & = \int_{t_i}^{t_{i-1}} \sigma^2(t)\frac{\mathrm{d}\lambda(t)}{\mathrm{d}t} s_{\theta}(x_t, t) \exp\left( -\displaystyle \int_{t}^{t_{i-1}} f(s) \, \mathrm{d}s \right) \, \mathrm{d}t \\[10pt] & = \int_{t_i}^{t_{i-1}} \sigma^2(t)\frac{\mathrm{d}\lambda(t)}{\mathrm{d}t} \frac{-\epsilon_{\theta}(x_t, t)}{\sigma(t)} \exp\left( -\displaystyle \int_{t}^{t_{i-1}} f(s) \, \mathrm{d}s \right) \, \mathrm{d}t \\[10pt] & = -s(t_{i-1}) \int_{\lambda_{t_i}}^{\lambda_{t_{i-1}}} e^{-\lambda} \epsilon_{\theta}(x_{\lambda}, \lambda) \mathrm{d}\lambda\label{eq:32} \end{align}\]

Putting equation \ref{eq:23} and equation \ref{eq:32} together, given an initial value $x_{t_i}$ at time $t=t_i$, the solution $x_{t_{i-1}}$ at time $t=t_{i-1}$ of diffusion ODEs in equation \ref{eq:20} is

\[\require{color} \definecolor{lightred}{rgb}{1, 0.9, 0.9} \fcolorbox{red}{lightred}{ $ \displaystyle x_{t_{i-1}} = \frac{s(t_{i-1})}{s(t_i)} x_t - s(t_{i-1}) \int_{\lambda_{t_i}}^{\lambda_{t_{i-1}}} e^{-\lambda} \, \epsilon_{\theta}(x_{\lambda}, \lambda) \, \mathrm{d}\lambda $ }\label{eq:33}\]

To simplify notation, we denote $\epsilon_{\theta}(x_{\lambda}, \lambda)$ by $\epsilon_{\theta}(\lambda)$ (or simply $\epsilon_{\lambda}$) throughout this post. This identity is the starting point of DPM-Solver: the linear part is exact; all approximation effort is focused on the single exponentially weighted integral in $\lambda$.

Keypoint. PF-ODE has a semi-linear structure. Solving the linear drift exactly and approximating only the model-dependent part—under a log-SNR reparameterization—is what enables high order with very few NFEs.

4.2.2 $\varphi$-functions inside: From the integral to high-order updates

Define the log-SNR step size $h$ and a local coordinate $\tau$ on the interval:

\[h := \lambda_{t_{i-1}} - \lambda_{t_i} > 0, \qquad \tau := \lambda - \lambda_{t_i} \in [0,h]\]

Then $ \lambda = \lambda_{t_i} + \tau$ and

\[e^{-\lambda} = e^{-(\lambda_{t_i}+\tau)} = e^{-\lambda_{t_{i-1}}}\,e^{\,h-\tau}\]

Substitute into the integral in \ref{eq:33} and use $d\lambda=d\tau$:

\[\int_{\lambda_{t_i}}^{\lambda_{t_{i-1}}} e^{-\lambda}\,\epsilon_\theta(x_\lambda,\lambda)\,d\lambda = e^{-\lambda_{t_{i-1}}}\int_{0}^{h} e^{\,h-\tau}\, \epsilon_\theta\big(x_{\lambda_{t_i}+\tau},\lambda_{t_i}+\tau\big)\,d\tau.\]

Since $s(t_{i-1})e^{-\lambda_{t_{i-1}}}=\sigma(t_{i-1})$, the nonlinear piece in \ref{eq:33} becomes

\[s(t_{i-1})\!\int e^{-\lambda}\epsilon_\theta(\lambda)\,d\lambda = \sigma(t_{i-1})\!\int_{0}^{h} e^{\,h-\tau}\, \epsilon_\theta\big(\lambda_{t_i}+\tau\big)\,d\tau\label{eq:37}\]

Step 1 — Local Taylor model of the network output

Expand the noise-prediction $\epsilon_\theta$ in $\lambda$ around $\lambda_{t_i}$ (derivatives w.r.t. $\lambda$):

\[\epsilon_\theta(\lambda_{t_i}+\tau) = \sum_{m=0}^{p}\frac{\tau^m}{m!}\,\epsilon^{(m)}_i \;+\; \underbrace{\frac{\tau^{p+1}}{(p+1)!}\,\epsilon^{(p+1)}(\lambda_{t_i}+\xi_\tau)}_{\text{Lagrange remainder}}\label{eq:38}\]

where $\epsilon_i := \epsilon_\theta(x_{\lambda_{t_i}},\lambda_{t_i})$ and $\epsilon^{(m)}_i := \partial_\lambda^m \epsilon_\theta(x_{\lambda},\lambda)\,\mid_{\lambda=\lambda_{t_i}}$

Step 2 — Pull out the exponential weight and normalize the interval

Plug \ref{eq:38} into \ref{eq:37} and integrate term-by-term, and Normalize $[0,h]$ to $[0,1]$ with $\theta=\tau/h$ (so $d\tau=h\,d\theta$):

\[\int_0^h e^{\,h-\tau}\,\frac{\tau^m}{m!}\,d\tau=\; h^{m+1}\int_0^1 e^{\,h(1-\theta)}\,\frac{\theta^m}{m!}\,d\theta\label{eq:39}\]

Step 3 — Meet the $\varphi$-functions

The $\varphi$-functions ³ used in exponential integrators are defined by

\[\varphi_k(h) \;:=\; \int_0^1 e^{(1-\theta)h}\,\frac{\theta^{k-1}}{(k-1)!}\,d\theta \;\;=\;\; \sum_{r=0}^{\infty}\frac{h^r}{(r+k)!}, \quad k\ge 1,\]

and satisfy the stable recurrences

\[\varphi_1(h) =\frac{e^h-1}{h} \quad \varphi_2(h) =\frac{e^h-h-1}{h^2} \quad \varphi_3(h)=\frac{e^h-1-h-h^2/2}{h^3},\ \dots\]

Comparing with \ref{eq:39} and setting $k=m+1$, we get the closed form

\[\int_0^h e^{\,h-\tau}\,\frac{\tau^m}{m!}\,d\tau = h^{m+1}\,\varphi_{m+1}(h)\label{eq:42}\]

Step 4 — Assemble the unified series

Now let’s simplify the integral term in Formula \ref{eq:37}:

\[\begin{align} & \int_{0}^{h} e^{\,h-\tau}\, \epsilon_\theta(\lambda_{t_i}+\tau)\,d\tau = \int_{0}^{h} e^{\,h-\tau}\,\left( \sum_{m=0}^{p}\frac{\tau^m}{m!}\,\epsilon^{(m)}_i \;+\; \underbrace{\frac{\tau^{p+1}}{(p+1)!}\,\epsilon^{(p+1)}_{\eta}}_{\text{Lagrange remainder}} \right)d\tau \\[10pt] = \quad & \left( \sum_{m=0}^{p} \int_{0}^{h} e^{\,h-\tau}\,\frac{\tau^m}{m!}\,\epsilon^{(m)}_id\tau \right) + \left( \underbrace{\int_{0}^{h} e^{\,h-\tau}\,\frac{\tau^{p+1}}{(p+1)!}\,\epsilon^{(p+1)}_{\eta}d\tau}_{\text{Lagrange remainder}} \right) \\[10pt] = \quad & \left(\sum_{m=0}^{p} h^{m+1}\epsilon^{(m)}_i\int_0^1 e^{\,h(1-\theta)}\,\frac{\theta^m}{m!}\,d\theta\, \right) + \left( h^{p+2}\epsilon^{(p+1)}_{\eta}\int_0^1 e^{\,h(1-\theta)}\,\frac{\theta^{p+1}}{(p+1)!}\,d\theta \right) \\[10pt] = \quad & \left( \sum_{m=0}^{p} h^{m+1}\,\varphi_{m+1}(h)\,\epsilon^{(m)}_i \; \right) +\; \left( \underbrace{h^{p+2}\,\varphi_{p+2}(h)\,\widetilde{R}_{p+1}}_{\mathcal{O}(h^{p+2})} \right), \end{align}\]

where $\widetilde{R}_{p+1}$ collects a bounded combination of $\epsilon^{(p+1)}$ on $\eta \in [\lambda_{t_i},\lambda_{t_{i-1}}]$. Multiply by $\sigma(t_{i-1})$ and put everything back into \ref{eq:33}:

\[\require{color} \definecolor{lightred}{rgb}{1, 0.9, 0.9} \fcolorbox{red}{lightred}{ $ \displaystyle x_{t_{i-1}} = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i} \;-\; \sigma(t_{i-1})\Big[ h\,\varphi_1(h)\,\epsilon_i + h^2\,\varphi_2(h)\,\epsilon'_i + {h^3}\,\varphi_3(h)\,\epsilon''_i + \cdots \Big] $ } \label{eq:45}\]

Approximating $\epsilon'_i,\epsilon''_i,\dots$ with a small number of extra evaluations (finite differences in $\lambda$) yields practical 1/2/3-order schemes.

Takeaway. “Linear exact + $\varphi$-weighted polynomial” is the entire story. The $\varphi_k$ compress the exponential weight and the Taylor polynomial into stable, closed-form coefficients, enabling high order in few steps.

4.2.3 Single-step DPM-Solver (ε-prediction view)

Below we list the standard single-step (no history) formulas in the $\epsilon$-prediction parameterization. Using $x_0$- or $v$-prediction only changes the linear maps between the model head and $\epsilon$. Please notice that in the following discussion, we use $\tilde x$ to represent an approximation of the true value $x$.

DPM-Solver-1 (Order-1, 1-NFE). Constant approximation of $\epsilon$ over $[\lambda_i,\lambda_{i-1}]$:

\[\begin{align} \tilde x_{t_{i-1}} & = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i} \;-\; \sigma(t_{i-1})\,h\,\varphi_1(h)\,\epsilon_{i} \\[10pt] & = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i} \;-\; \sigma(t_{i-1})\,(e^h-1)\,\epsilon_{i}\label{eq:49} \end{align}\]

For small $h$ this matches a DDIM-like first-order step; for large $h$, $\varphi_1$ preserves the correct exponential scaling.

DPM-Solver-2 (Order-2, 2-NFEs). One predict–correct to estimate $\epsilon’_i$.

Select a mid point $t_s$ between $t_i$ and $t_{i-1}$, with corresponding log-SNR $\lambda_{t_s}$:
\[\lambda_{t_s} = t_{\lambda}(\lambda_{t_i}+rh)\]
Predict with Order-1 DPM-Solver to get a provisional end state $x_{t_s}$ corresponding to $t_s$.
\[{\tilde x_{t_s}}=\frac{s(t_{s})}{s(t_i)}x_{t_i} - \sigma(t_{s})\,(e^{rh}-1)\,\epsilon_{\theta}(x_{t_i}, t_i)\]
Approximating First-Order derivatives $\epsilon'_{\theta}(x_{t_i}, t_i)$ using Finite Differences
\[\epsilon'_{\theta}(x_{t_i}, t_i) \approx \frac{\epsilon_{\theta}({\tilde x_{t_s}}, t_s)\,-\,\epsilon_{\theta}(x_{t_i}, t_i)}{\lambda_{t_s} - \lambda_{t_i}} = \frac{\epsilon_{\theta}({\tilde x_{t_s}}, t_s)\,-\,\epsilon_{\theta}(x_{t_i}, t_i)}{rh}\]
Plug into the formula \ref{eq:45} and update:
\[\begin{align} \tilde x_{t_{i-1}} & = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i}\;-\; \sigma(t_{i-1})\,\left( h\,\varphi_1(h)\,\epsilon_{\theta}(x_{t_i}, t_i) + h^2\,\varphi_2(h)\,\epsilon'_{\theta}(x_{t_i}, t_i) \right) \\[10pt] & = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i}\;-\; \sigma(t_{i-1})\,(e^h-1)\,\left(\epsilon_{i} + \frac{\epsilon_{s}\,-\,\epsilon_{i}}{2r} \right) \end{align}\]
If we take the intermediate point at the midpoint, that is $r=1/2$, the second-order DPM-Solver simplifies to
\[\tilde x_{t_{i-1}} = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i}\;-\; \sigma(t_{i-1})\,(e^h-1)\,\epsilon_{s}\]

DPM-Solver-3 (Order-3, 3-NFEs). Two predictions to estimate $\epsilon'_i,\epsilon''_i$.

Place two internal points inside $[t_i, t_{i-1}]$ by moving a fraction of $h$ in the $\lambda$ domain and mapping back to $t$:
\[s_{1} = t_\lambda\!\big(\lambda_{t_{i}}+r_1 h\big),\qquad s_{2} = t_\lambda\!\big(\lambda_{t_{i}}+r_2 h\big).\]
Stage-1 state at $s_1$ (first-order exponential Euler). Propagate from $t_{i}$ to $s_1$ using a 1st-order exponential update driven by $\epsilon_\theta(x_{t_{i}},t_{i})$:
\[\tilde u_{1} = \frac{s(t_{s_{1}})}{s({t_{i}})}\,x_{t_{i}} \;-\; \sigma(t_{s_{1}})\Big(e^{r_1 h}-1\Big)\, \epsilon_\theta(x_{t_{i}},t_{i}).\]
Residual $D_1$ at the first stage. Measure how much the model output changes when moving to $u_1$ at $s_1$:
\[\mathbf{D}_1 = \epsilon_\theta(\tilde u_1,s_1) - \epsilon_\theta(\tilde x_{t_{i}},t_{i}).\]
This quantity encodes first- and second-derivative information (in $\lambda$) about $\epsilon_\theta$ over a fraction $r_1$ of the step.
Stage-2 state at $s_2$ (second-order with a correction). Now propagate to the second internal point $s_2$ using a 2nd-order formula that reuses $D_1$ as a cheap estimate of derivatives:
\[\begin{align} \tilde u_2 =\;&\frac{s(t_{s_{2}})}{s(t_{i})}\,x_{t_{i}}\;-\;\sigma(t_{s_{2}})\left((e^{r_2 h}-1)\,\epsilon_\theta(x_{t_{i}},t_{i}) + (r_2h)^2\varphi_2(r_2h)\,\epsilon_\theta^{'}(x_{t_{i}},t_{i}) \right)\\[10pt] &\frac{s(t_{s_{2}})}{s(t_{i})}\,x_{t_{i}}\;-\;\sigma(t_{s_{2}})\left((e^{r_2 h}-1)\,\epsilon_\theta(x_{t_{i}},t_{i}) + (r_2h)^2\varphi_2(r_2h)\,\frac{\mathbf{D}_1}{r_1h} \right)\\[10pt] &\frac{s(t_{s_{2}})}{s(t_{i})}\,x_{t_{i}}\;-\;\sigma(t_{s_{2}})\left((e^{r_2 h}-1)\,\epsilon_\theta(x_{t_{i}},t_{i}) + \frac{r_2}{r_1}(\varphi_1(r_2h)-1)\,{\mathbf{D}_1} \right) \end{align}\]
Residual $D_2$ at the second stage. Evaluate the model again at $\tilde u_2$ and take a residual w.r.t. the base point:
\[\mathbf{D}_2 = \epsilon_\theta(\tilde u_2,s_2)-\epsilon_\theta(\tilde x_{t_{i}},t_{i}).\label{eq:62}\]
Thanks to the design of the correction in the previous line, $\mathbf{D}_2$ now encodes the combined first- and second-derivative information to third-order accuracy over the full step.
Final third-order update to $t_i$. Assemble the one-step third-order solution using the base value and the second residual:
\[\tilde x_{t_{i-1}} = \frac{s(t_{i-1})}{s(t_{i})}\,x_{t_{i}}\;-\;\sigma(t_{i-1})\left( (e^{h}-1)\,\epsilon_\theta(x_{t_{i}},t_{i})\;+\;\frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\,\mathbf{D}_{2} \right)\]

4.2.4 LTE and GTE Analysis

Important Preliminaries: If $\epsilon(x_s,s)$ is Lipschitz in $x$ with constant $L_x(s)$ (under some vector norm $|\cdot|$), then

\[\|\epsilon(\tilde x_s, s) - \varepsilon(x_s, s)\| \;\le\; L_x(s)\;\|\tilde x_s - x_s\| = \mathcal{O}\;(\|\tilde x_s - x_s\|)\]

According to the previous discussions, the error term of dpm-solver originates from the approximation of the nonlinear part.

DPM-Solver-1:

From equation \ref{eq:49}, we can write down the following LTE expression:

\[\mathbf{LTE}_{1} = x_{t_{i-1}} - \tilde x_{t_{i-1}} = \sigma(t_{i-1})\sum_{k=2}^{\infty}\left( h^k\varphi_k(h)\epsilon^{(k-1)}_{\theta}(x_{t_i}, t_i) \right)\]

we only need to consider the first term ($k=2$).

\[\begin{align} \mathbf{LTE}_{1} & = x_{t_{i-1}} - \tilde x_{t_{i-1}} \approx \sigma(t_{i-1})h^2\varphi_2(h)\epsilon^{'}_{\theta}(x_{t_i}, t_i) \\[10pt] & = \sigma(t_{i-1})h^2\epsilon^{'}_{\theta}(x_{t_i}, t_i)\frac{e^h-1-h}{h^2}\label{eq:66} \\[10pt] & \approx \sigma(t_{i-1})h^2\epsilon^{'}_{\theta}(x_{t_i}, t_i)\,\times \frac{1}{2}\label{eq:67} \\[10pt] & = \mathcal{O}(h^2) \end{align}\]

The reason why formulas \ref{eq:66} to \ref{eq:67} hold is that the exponential function $e^h$ satisfies Taylor expansion $e^h-1-h=h^2/2+\mathcal{O}(h^3)$

The Global Truncation Error is equal to $(N/h)\times \mathbf{LTE} = \mathcal{O}(h)$.

DPM-Solver-2:

Similarly, we only consider the first different item

\[\begin{align} \mathbf{LTE}_{2} & = x_{t_{i-1}} - \tilde x_{t_{i-1}} \\[10pt] & = \sigma(t_{i-1})h^2\varphi_2(h)\left( \frac{\epsilon_{\theta}(x_{t_i}, t_i)- \epsilon_{\theta}(x_{t_s}, t_s)}{rh} - \frac{\epsilon_{\theta}(x_{t_i}, t_i)- \epsilon_{\theta}(\tilde x_{t_s}, t_s)}{rh} \right) \\[10pt] & = \sigma(t_{i-1})h^2\varphi_2(h)\left( \frac{\epsilon_{\theta}(\tilde x_{t_s}, t_s)- \epsilon_{\theta}(x_{t_s}, t_s)}{rh} \right) \\[10pt] & = \sigma(t_{i-1})h^2\varphi_2(h)\left( \frac{\mathcal{O}\|\tilde x_{t_s} - x_{t_s}\|}{rh} \right) \\[10pt] & = \sigma(t_{i-1})h^2\varphi_2(h)\left( \frac{\mathcal{O}(h^2)}{rh} \right) = \mathcal{O}(h^3) \end{align}\]

The Global Truncation Error is equal to $(N/h)\times \mathbf{LTE} = \mathcal{O}(h^2)$.

DPM-Solver-3:

Similarly, we only consider the first different item

\[\begin{align} \mathbf{LTE}_{3} & = x_{t_{i-1}} - \tilde x_{t_{i-1}} \\[10pt] & = \sum_{k=2}^{3}h^k\varphi_k(h)\epsilon^{(k)}_{\theta}(x_{t_i}, t_i) - \;\frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\,\mathbf{D}_{2}\label{eq:76} \end{align}\]

Consider equation \ref{eq:62}, $\mathbf{D}_{2}$ can be written via taylor expansion

\[\begin{align} \mathbf{D}_2 & = \epsilon_\theta(\tilde u_2,s_2)-\epsilon_\theta(\tilde x_{t_{i}},t_{i}) \\[10pt] & = \left(\epsilon_\theta(\tilde u_2,s_2)-\epsilon_\theta(u_2,s_2) \right)+\left( \epsilon_\theta( u_2,s_2)-\epsilon_\theta(\tilde x_{t_{i}},t_{i}) \right) \\[10pt] & \approx \underbrace{\mathcal{O}(\|\tilde u_2 - u_2\|)}_{\text Lipschitz} + \left( \epsilon_\theta( u_2,s_2)-\epsilon_\theta(\tilde x_{t_{i}},t_{i}) \right) \\[10pt] & \approx \mathcal{O}(h^3)+\left( \epsilon_\theta( u_2,s_2)-\epsilon_\theta(\tilde x_{t_{i}},t_{i}) \right) \\[10pt] & = \mathcal{O}(h^3)+\left( (r_2h)\epsilon_{i}^{'} + \frac{(r_2h)^2}{2}\epsilon_{i}^{''} + \mathcal{O}(h^3) \right) \\[10pt] & = (r_2h)\epsilon_{i}^{'} + \frac{(r_2h)^2}{2}\epsilon_{i}^{''} + \mathcal{O}(h^3) \end{align}\]

Simplify the result of $\mathbf{D}_2$ by adding the coefficient.

\[\begin{align} & \frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\,\mathbf{D}_{2} = \frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\,\left( (r_2h)\epsilon_{i}^{'} + \frac{(r_2h)^2}{2}\epsilon_{i}^{''} + \mathcal{O}(h^3) \right) \\[10pt] = & \frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)(r_2h)\epsilon_{i}^{'} + \frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\frac{(r_2h)^2}{2}\epsilon_{i}^{''} + \mathcal{O}(h^4) \\[10pt] = & h^2\varphi_2(h)\epsilon_{i}^{'} + \frac{r_2h^3}{2}\varphi_2(h)\epsilon_{i}^{''} + \mathcal{O}(h^4) \end{align}\]

Now, we substitute the above results back into Equation \ref{eq:76}

\[\begin{align} \mathbf{LTE}_{3} & = \sum_{k=2}^{3}h^k\varphi_k(h)\epsilon^{(k)}_{\theta}(x_{t_i}, t_i) - \;\frac{1}{r_2}\left(\frac{e^{h}-1}{h}-1\right)\,\mathbf{D}_{2} \\[10pt] & = \left( h^3\varphi_3(h) - \frac{r_2h^3}{2}\varphi_2(h) \right) \epsilon_{i}^{''} + \mathcal{O}(h^4) \\[10pt] & = \left( \underbrace{\frac{h^3}{6}+\mathcal{O}(h^4)}_{h^3\varphi_3(h)} - \underbrace{\Big(\dfrac{h}{2}+\dfrac{h^2}{6}+\dfrac{h^3}{24}+O(h^4)\Big)}_{h\varphi_2(h)}\dfrac{r_2h^2}{2} \right) \epsilon_{i}^{''} + \mathcal{O}(h^4) \\[10pt] & = \left( \frac{1}{6} - \frac{r_2}{4}\right)h^3\epsilon_{i}^{''} + \mathcal{O}(h^4) \end{align}\]

It is obvious to see that DPM-Solver-3 achieves a 4th-order local truncation error (LTE) only when the second intermediate point parameter $ r_2 $ is set to $ 2/3 $. This, in turn, results in a 3rd-order global truncation error (GTE).

4.2.5 Relation to DEIS and bridge to DPM-Solver++

DEIS vs. DPM-Solver. Both exploit the semi-linear PF-ODE. DEIS uses an exponential-integrator discretization with polynomial extrapolation under an integrating factor; DPM-Solver absorbs the exponential weight into $\varphi_k$ coefficients in $\lambda$, yielding broader stability with closed-form weights in few steps.
Toward DPM-Solver++. DPM-Solver++ systematizes these ideas under data-prediction, adds single-step and multistep variants, and introduces practical techniques (e.g., thresholding) for robust guided sampling. We will detail this in Section 4.3.

Takeaway. DPM-Solver converts PF-ODE’s structure—linear exactness + log-SNR exponential weight—into a concise algorithmic recipe: $\varphi$-functions times low-degree polynomials. This delivers high order at low NFE, and DPM-Solver++ carries the recipe into the strong-guidance regime.

4.3 DPM-Solver++

DPM-Solver++ ⁴ is the “data-prediction” twin of §4.2’s DPM-Solver: it solves the PF-ODE by integrating the linear drift exactly while discretizing one exponentially weighted integral of the model output, but now in the $x_0$-prediction view rather than the $\epsilon$-prediction view. The switch flips both the weight and the linear prefactor (from an $s$-ratio with $e^{-\lambda}$ to a $\sigma$-ratio with $e^{+\lambda}$), and—crucially—enables dynamic thresholding on $x_0$ that stabilizes large classifier-free guidance (CFG).

Briefly, compared with DPM-Solver, DPM-Solver++ has the following changes:

Switching to data prediction (predict $x_0$ from $x_t$), which empirically enlarges the stability margin;
Adding dynamic thresholding (clip the predicted $x_0$ into the data range);
Providing multistep variants to shrink the effective step size.

4.3.1 Setup in $x_0$-prediction: exact one-step identity

Start from the PF-ODE and rewrite the score via the model’s $x_0$ head,

\[s_\theta(x_t,t)\;\approx\;-\frac{x_t-s(t)\,x_\theta(x_t,t)}{\sigma^2(t)}.\]

Plugging this into the PF-ODE gives

\[\frac{dx_t}{dt}=f(t)x_t-\frac{g^2(t)}{2}s_{\theta}\quad \Longrightarrow \quad \frac{dx_t}{dt} = \left(f(t) + \frac{g^2(t)}{2\sigma^2(t)}\right)\,x_t \;-\; \frac{g^2(t)}{2\sigma^2(t)}\,x_\theta\]

Repeating the variation-of-constants steps in §4.2 and changing variables from $t$ to log-SNR $\lambda=\log(s/\sigma)$ yields the exact one-step identity in the data-prediction view:

\[\require{color} \definecolor{lightred}{rgb}{1, 0.9, 0.9} \fcolorbox{red}{lightred}{ $ \displaystyle x_{\lambda_t} = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i} \;+\;\sigma(t_{i-1})\,\int_{\lambda_{t_i}}^{\lambda_{t_{i-1}}} e^{\lambda}\,x_{\theta}(x_{\lambda},\lambda)\,d\lambda $ }\]

Introduce the step $h=\lambda_{t_{i-1}}-\lambda_{t_i}>0$ and set $\hat\varphi_k(h)=\varphi_k(-h)$, following the derivation method in §4.2, we obtain the backbone formula for DPM-Solver++.

\[\require{color} \definecolor{lightred}{rgb}{1, 0.9, 0.9} \fcolorbox{red}{lightred}{ $ \displaystyle x_{t_{i-1}} = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i} \;+\; s(t_{i-1})\Big[ h\hat\varphi_1(h)\,x_{i} + h^2\,\hat\varphi_2(h)\,x'_{i} + {h^3}\,\hat\varphi_3(h)\,x''_{i} + \cdots\Big] $ }\label{eq:93}\]

where $x_i := x_\theta(x_{\lambda_{t_i}},\lambda_{t_i})$ and $x^{(m)}_i := \partial_\lambda^m x_\theta(x_{\lambda},\lambda)\,\mid_{\lambda=\lambda_{t_i}}$

4.3.2 Single-step DPM-Solver++ (data-prediction view)

Same as in §4.2, higher orders come from approximating $x'_{i},\,x''_{i}$ with a few extra evaluations.

1st-order DPM-Solver++(1-NFE). Constant approximation of $x_0$ over $[\lambda_i,\lambda_{i-1}]$:

\[\begin{align} \tilde x_{t_{i-1}} & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\;+\; s(t_{i})h\,\hat\varphi_1(h)\,x_{i} \\[10pt] & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\; - \; s(t_{i})\,\left( e^{-h} - 1\right)\,x_{i} \end{align}\]

2nd-order DPM-Solver++(2-NFE). we use a similar technique as DPM-Solver-2 to estimate the derivative $x_{i}^{‘}$. Specifically, we introduce an additional intermediate time step $t_{s}$ between $t_{i}$ and $t_{i-1}$

\[\lambda_{t_s} = t_{\lambda}(\lambda_{t_i}+rh)\]

Predict with Order-1 DPM-Solver++ to get the estimated function values at $t_s$ ($\tilde x_{t_s}$)

\[{\tilde x_{t_s}}=\frac{\sigma(t_{s})}{\sigma(t_i)}\,x_{t_i} - s(t_{i})\,\left( e^{-rh} - 1\right)\,x_{i}\]

Approximating First-Order derivatives $x'_{\theta}(x_{t_i}, t_i)$ using Finite Differences

\[x'_{\theta}(x_{t_i}, t_i) \approx \frac{x_{\theta}({\tilde x_{t_s}}, t_s)\,-\,x_{\theta}(x_{t_i}, t_i)}{\lambda_{t_s} - \lambda_{t_i}} = \frac{x_{\theta}({\tilde x_{t_s}}, t_s)\,-\,x_{\theta}(x_{t_i}, t_i)}{rh}\]

Plug into the formula \ref{eq:93} and update:

\[\begin{align} \tilde x_{t_{i-1}} & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\; + \; s(t_{i-1})\Big[h\hat\varphi_1(h)\,x_{i}+ h^2\,\hat\varphi_2(h)\,x'_{i}\Big] \\[10pt] & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\;-\; s(t_{i-1})\,(e^{-h}-1)\,\left(x_{i} + \frac{x_{s}\,-\, x_{i}}{2r} \right)\label{eq:100} \end{align}\]

If we take the intermediate point at the midpoint, that is $r=1/2$, the second-order DPM-Solver simplifies to

\[\tilde x_{t_{i-1}} = \frac{s(t_{i-1})}{s(t_i)}\,x_{t_i}\;-\; s(t_{i-1})\,(e^{-h}-1)\,x_{s}\]

4.3.3 Multistep DPM-Solver++ (2M)

At each step (from $t_i$ to $t_{i-1}$), the proposed singlestep ($4.3.2) solver needs two sequential function evaluations of the neural network $x_{\theta}$. Moreover, the intermediate values are only used once and then discarded. Such method loses the previous information and may be inefficient. By drawing on the multi-step numerical ODE methods ($2.1.2), historical information can be saved and reused. For instance, when solving $x_{t_{i-1}}$ with the 2M method, the information of $x_{t_i}$ and $x_{t_{i+1}}$ are utilized.

Approximating First-Order derivatives $x'_{\theta}(x_{t_i}, t_i)$ using Finite Differences

\[x'_{\theta}(x_{t_i}, t_i) \approx \frac{x_{\theta}({\tilde x_{t_i}}, t_i)\,-\,x_{\theta}(x_{t_{i+1}}, t_{i+1})}{\lambda_{t_{i}} - \lambda_{t_{i+1}}} = \frac{x_{\theta}({\tilde x_{t_i}}, t_i)\,-\,x_{\theta}(x_{t_{i+1}}, t_{i+1})}{h_{i+1}}\]

Plug into the formula \ref{eq:93} and update:

\[\begin{align} \tilde x_{t_{i-1}} & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\; + \; s(t_{i-1})\Big[h\hat\varphi_1(h)\,x_{i}+ h^2\,\hat\varphi_2(h)\,x'_{i}\Big] \\[10pt] & = \frac{\sigma(t_{i-1})}{\sigma(t_i)}\,x_{t_i}\;-\; s(t_{i-1})\,(e^{-h_i}-1)\,\left(x_{i} + \frac{x_{i}\,-\, x_{i+1}}{2(h_{i+1}/h_i)} \right)\label{eq:104} \end{align}\]

Compared with Eq \ref{eq:100} and Eq \ref{eq:104}, when setting $h_{i+1}/h_i=r$, the two are equivalent.

4.3.4 Thresholding (Imagen-style) and when to use it

Many image datasets have bounded dynamic ranges. Dynamic thresholding clips the predicted $x_0$ element-wise to a percentile-based bound before mapping back to $x_t$. In DPM-Solver++ this is straightforward because we are already in the $x_0$ view. It mitigates train–test mismatch and improves stability at high guidance scales. (Note: unsuitable for latent-space models like Stable Diffusion’s VAE latents; it’s for pixel-space models.) ([arXiv][2], [Hugging Face][4])

4.3.5 Practical recipe (what to actually use)

Guided sampling (classifier or CFG): order = 2 (single-step or multistep).
Unconditional sampling: order = 3 often gives a quality edge. These settings mirror common library defaults and tips (e.g., Diffusers). ([Hugging Face][3])

Takeaway. In practice, start with DPM-Solver++-2; enable 2M if steps are very few or guidance is large; use thresholding only for pixel-space models.

4.3.6 Relationship to §4.2 and to UniPC

Versus §4.2 (ε-view). Same exponential-integrator skeleton, but the weight flips ($e^{-\lambda}\rightarrow e^{+\lambda}$) and the linear factor flips ($s$-ratio $\rightarrow$ $\sigma$-ratio). That (plus thresholding) explains the improved robustness under guidance.
Bridge to §4.4 (UniPC). UniPC unifies/prunes coefficients from both single-step and multistep families (including ++), often offering a sweet spot in few-NFE regimes. We will pick up from this $x_0$-view to connect UniPC’s predictor–corrector forms.

4.4 UniPC

References

Liu L, Ren Y, Lin Z, et al. Pseudo numerical methods for diffusion models on manifolds[J]. arXiv preprint arXiv:2202.09778, 2022. ↩ ↩²
Lu C, Zhou Y, Bao F, et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps[J]. Advances in neural information processing systems, 2022, 35: 5775-5787. ↩
Hochbruck M, Ostermann A. Exponential integrators[J]. Acta Numerica, 2010, 19: 209-286. ↩
Lu C, Zhou Y, Bao F, et al. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models[J]. Machine Intelligence Research, 2025: 1-22. ↩

Share on

Twitter Facebook LinkedIn

Sabour A, Fidler S, Kreis K. Align Your Flow: Scaling Continuous-Time Flow Map Distillation[J]. arXiv preprint arXiv:2506.14603, 2025. ↩
Boffi N M, Albergo M S, Vanden-Eijnden E. Flow map matching with stochastic interpolants: A mathematical framework for consistency models[J]. Transactions on Machine Learning Research, 2025. ↩
Hu Z, Lai C H, Mitsufuji Y, et al. CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models[J]. arXiv preprint arXiv:2509.24526, 2025. ↩

Methods	Order	NFE	Sampling Points	Update (explicit form)
Euler	1	1	$t_n $	\(x_{t_{n+1}} = x_{t_n} + h*f(t_n, x_{t_n})\)
Heun (RK2)	2	2	\(t_n \\ t_n+h\)	\(k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+{h}, x_{t_n}+{h}k_1) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{2}(k_1+k_2)\)
RK3	3	3	\(t_n \\ t_n+\frac{h}{2} \\ t_n+h\)	\(k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_1) \\ k_3=f(t_n+h, x_{t_n}-hk_1+2hk_2) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{6}(k_1+4k_2+k_3)\)
RK4	4	4	\(t_n \\ (t_n+\frac{h}{2})(2\times) \\ t_n+h\)	\(k_1=f(t_n, x_{t_n}) \\ k_2=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_1) \\ k_3=f(t_n+\frac{h}{2}, x_{t_n}+\frac{h}{2}k_2) \\ k_4=f(t_n+h, x_{t_n}+hk_3) \\ x_{t_{n+1}}=x_{t_n}+\frac{h}{6}(k_1+2k_2+2k_3+k_4)\)

Anbu Huang

Deriving LTE

Deriving GTE

Deriving LTE

Deriving GTE

Step 1 — Local Taylor model of the network output

Step 2 — Pull out the exponential weight and normalize the interval

Step 3 — Meet the $\varphi$-functions

Step 4 — Assemble the unified series

Share on

You May Also Enjoy