Preference Alignment of Diffusion and Flow-Matching Models

61 minute read

📅 Published: January 25, 2026

📘 TABLE OF CONTENTS

Part I — Foundations of Reinforcement Learning
Part II — Preference Alignment for Diffusion and Flow-Matching
- 7. Unifying View: Denoising / Flow as a Policy Over Trajectories
- 8. Technique Routes to Implement
Part III — Differentiable Reward Optimization
Part IV — Policy Gradient Style RL fine-tuning
Part V — Direct Preference Optimization (RL-Free Style)
Part VI — Black-box Post-hoc Alignment
- 25. A Shared Template for Black-box Post-hoc Alignment
- 26. PFM
Reference

Preference alignment has become an essential post-training problem for modern diffusion and flow-matching models. While pretraining teaches a generative model to reproduce the data distribution, real applications often require a different target: images or videos that are more aesthetically pleasing, more faithful to prompts, safer, more identity-consistent, or more aligned with human preferences. This article develops a unified view of this problem through the lens of reinforcement learning.

We begin with the foundations of RL, including Markov decision processes, Bellman equations, policy gradients, actor-critic methods, importance sampling, TRPO, and PPO. We then reinterpret diffusion and flow-based generation as a trajectory-level decision process: each denoising or flow step is a policy action, the intermediate latent is the state, the sampler defines the transition, and the final generated sample receives a reward. Under this view, preference alignment becomes the problem of shifting a pretrained trajectory policy toward higher-reward samples while controlling distributional drift from the reference model.

Building on this perspective, we organize current diffusion and flow preference alignment methods into seven major routes: inference-time reward guidance, differentiable reward direct optimization, online RL / policy-gradient fine-tuning, DPO-style pairwise preference optimization, KTO-style binary-feedback optimization, offline reward-weighted regression, and black-box post-hoc alignment. The later sections then study representative algorithms in detail, including AlignProp, DRaFT, adjoint-based methods, DDPO, DPOK, Flow-GRPO, DenseGRPO, Diffusion-DPO, D3PO, SPO, DSPO, Diffusion-KTO, and Preference Flow Matching.

Part I — Foundations of Reinforcement Learning

The recent successes in aligning Large Language Models (LLMs) and fine-tuning Diffusion models are not isolated phenomena; they are rooted in the rigorous mathematical framework of Reinforcement Learning (RL). To understand how a model learns from human preferences, we must first revisit the classical RL paradigms that evolved from solving simple grid-worlds to mastering complex, high-dimensional control tasks.

1. The Mathematical Core: Markov Decision Processes

At the heart of any RL problem lies the Markov Decision Process (MDP), a mathematical framework used to model decision-making where outcomes are partly random and partly under the control of a decision-maker. An MDP is typically defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$:

$\mathcal{S}$ (State Space): The set of all possible states the agent can inhabit. In classical RL, this might be coordinates on a map; in LLMs, this represents the sequence of tokens generated so far.
$\mathcal{A}$ (Action Space): The set of all possible actions. In LLMs, this is the entire vocabulary of the tokenizer.
$\mathcal{P}(s’\mid s, a)$ (Transition Probability): The dynamics of the environment, representing the probability of moving to state $s’$ given state $s$ and action $a$.
$\mathcal{R}(s, a, s’)$ (Reward Function): A scalar feedback signal $r_t$ received after performing action $a$ in state $s$.
$\gamma \in [0, 1]$ (Discount Factor): A parameter that determines the importance of future rewards. A $\gamma$ close to 0 makes the agent “myopic,” while a $\gamma$ close to 1 makes it “far-sighted.”

The fundamental goal of RL is to find a Policy $\pi(a\mid s)$, which is a mapping from states to a probability distribution over actions, that maximizes the Expected Cumulative Discounted Return:

\[J(\pi) = \mathbb{E}_{\tau \sim \pi} [G_t] = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \right]\]

where

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots)\]

represents a trajectory (also known as an episode) sampled under policy $\pi$ .

2. Theoretical Solutions: Bellman Equations

To solve an MDP, we define two critical value functions:

State-Value Function $V^\pi(s)$: The expected return starting from state $s$ and following policy $\pi$.
\[V^\pi(s) = \mathbb{E}_\pi [G_t | S_t = s]\]
Action-Value Function $Q^\pi(s, a)$: The expected return starting from state $s$, taking action $a$, and thereafter following policy $\pi$.
\[Q^\pi(s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a]\]

The relationship between these is governed by the Bellman Expectation Equation. For Q function, it satisfies:

\[\begin{align} Q^\pi(s,a) & =\mathbb E_{s'\sim P(\cdot|s,a)}\Big[r(s,a,s')+\gamma \mathbb E_{a'\sim\pi(\cdot|s')}\big[Q^\pi(s',a')\big]\Big] \\[10pt] & = \sum_{s'}P(s'|s,a)\Big(r(s,a,s')+\gamma \sum_{a'}\pi(a'|s')Q^\pi(s',a')\Big) \end{align}\]

Similary, For V function, it satisfies:

\[\begin{align} V^\pi(s) & =\mathbb E_{a\sim \pi(\cdot|s)}\Big[\mathbb E_{s'\sim P(\cdot|s,a)}\big[r(s,a,s')+\gamma V^\pi(s')\big]\Big] \\[10pt] & = \sum_a \pi(a|s)\sum_{s'}P(s'|s,a)\Big(r(s,a,s')+\gamma V^\pi(s')\Big) \end{align}\]

For the optimal policy $\pi^*$, we follow the Bellman Optimality Equation. For Q function, it satisfies:

\[\begin{align} Q^*(s, a) & =\mathbb E_{s'\sim P(\cdot|s,a)}\big[r(s,a,s')+\gamma \max_{a'} Q^*(s',a')\big] \\[10pt] & = \sum_{s'}P(s'|s,a)\Big(r(s,a,s')+\gamma \max_{a'} Q^*(s',a')\Big) \end{align}\]

For V function, it satisfies:

\[\begin{align} V^*(s) & =\max_a \mathbb E_{s'\sim P(\cdot|s,a)}\big[r(s,a,s')+\gamma V^*(s')\big] \\[10pt] & = \max_a \sum_{s'}P(s'|s,a)\Big(r(s,a,s')+\gamma V^*(s')\Big) \end{align}\]

3. Methodological Taxonomies: Model-Based vs. Model-Free

A primary distinction in RL is how the agent perceives the environment’s “rules.” From this perspective, we can divide reinforcement learning into two major categories: model-based and model-free

In Model-Based RL, the agent attempts to learn the transition dynamics $\mathcal{P}(s’\mid s, a)$ and the reward function $\mathcal{R}$. Once a model of the world is built, the agent can “plan” by looking ahead (using methods like Value Iteration or Tree Search).

Model-free RL bypasses learning the transition dynamics. It learns directly from experience through trial and error. Because modern generative models (LLMs/Diffusion) have transition dynamics that are either trivial (the next state is simply the previous text + the new token) or incredibly complex (high-dimensional image latent space), Model-Free RL is the dominant paradigm for post-training alignment.

4. Model-based RL: Dynamic Programming

Before we dive into learning from trial and error (Model-Free), we must understand how to solve an MDP when the environment’s dynamics $\mathcal{P}$ and reward function $\mathcal{R}$ are perfectly known. This field is known as Dynamic Programming (DP).

4.1 The Generalized Policy Iteration

Almost all RL algorithms can be viewed as a variation of Generalized Policy Iteration (GPI). GPI consists of two simultaneous processes:

Policy Evaluation: Making the value function $V^\pi$ consistent with the current policy $\pi$.
Policy Improvement: Making the policy $\pi$ greedy with respect to the current value function.

These two processes interact: the evaluation makes the value function a better reflection of the policy, and the improvement makes the policy better based on that value function.

4.2 Policy Iteration

Policy Iteration decomposes the problem into two explicit, alternating steps:

Step 1: Policy Evaluation. Given a policy $\pi$, we compute $V^\pi$ by iteratively applying the Bellman Expectation Equation as an update rule:
\[V_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V_k(s')]\]
This continues until $V$ converges.
Step 2: Policy Improvement. Once we have $V^\pi$, we update the policy by being greedy with respect to the action-value function $Q^\pi(s, a)$:
\[\pi'(s) = \arg\max_{a} Q^\pi(s, a) = \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V^\pi(s')]\]

According to the Policy Improvement Theorem, if $\pi’$ is greedy with respect to $V^\pi$, then $V^{\pi’}(s) \geq V^\pi(s)$ for all $s$. This guarantees convergence to the optimal policy $\pi^*$.

4.3 Value Iteration

Value Iteration collapses the evaluation and improvement into a single step. Instead of waiting for the policy evaluation to converge, it updates the value function directly using the Bellman Optimality Equation:

\[V_{k+1}(s) = \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V_k(s')]\]

In this approach, the policy is only implicitly updated by taking the max. Once $V$ converges to $V^*$, the optimal policy $\pi^*$ is extracted by being greedy.

5. Transition to Model-Free RL: Learning from Experience

In real-world scenarios like Large Language Models, we do not have access to the “Transition Matrix” $\mathcal{P}$. We don’t know the exact probability of every possible next token in a $50,000$-word vocabulary for every possible context. This necessitates Model-Free RL.

Model-Free RL learns directly from trajectories (sequences of states, actions, and rewards) sampled from the environment. The “Agent” (the LLM) generates a sequence, and the “Environment” (the Reward Model/Human) provides feedback.

Without a model, we cannot compute the expectation $\sum_{s’} p(s’, r \mid s, a) [\dots]$. Instead, we estimate it using samples.

5.1 Sampling Targets: Temporal Difference (TD) vs. Monte Carlo (MC)

How does an agent update its estimate of the value function? There are two fundamental approaches to calculating the “target” for learning:

Monte Carlo (MC) Methods. MC methods wait until the end of an entire episode to calculate the total return $G_t$.

Target: $G_t$ (the actual realized return).
Update:
\[V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]\]
Characteristics: Unbiased but high variance (since $G_t$ depends on many random actions and transitions).

Temporal Difference (TD) Learning. TD methods update estimates based on other learned estimates, a process called bootstrapping.

Target:
\[r_{t+1} + \gamma V(S_{t+1}),\qquad \text{The TD Target}\]
Update:
\[V(S_t) \leftarrow V(S_t) + \alpha [r_{t+1} + \gamma V(S_{t+1}) - V(S_t)]\]
Characteristics: Biased (due to bootstrapping) but low variance. This is the foundation of Q-learning.

5.2 Sampling Strategies: On-Policy vs. Off-Policy

This distinction is crucial for understanding why algorithms like PPO (On-policy) and DQN (Off-policy) behave differently.

On-Policy Learning: The agent learns about the policy $\pi$ that it is currently using to interact with the environment. Every time the policy is updated, the old data becomes obsolete. Representative algorithms including SARSA, PPO, etc.

Off-Policy Learning: The agent learns about a target policy $\pi$ while following a different behavior policy $\mu$. This allows the agent to reuse old data stored in a “Replay Buffer.” Representative algorithms including Q-Learning, DQN.

The Mathematical Trick of Off-Policy Learning is that, most of the time, it requires Importance Sampling to correct the distribution shift:

\[\mathbb{E}_{a \sim \pi} [f(a)] = \mathbb{E}_{a \sim \mu} \left[ \frac{\pi(a|s)}{\mu(a|s)} f(a) \right]\]

This is an excellent point. To transition from classical tabular RL to the scales required by LLMs and Diffusion models, the classification of algorithms based on what they parameterize is the most logical and standard approach.

In large-scale spaces, we replace lookup tables with Function Approximators (typically Neural Networks with parameters $\theta$). We can categorize these into three main families based on the learning target.

6. Large-Scale RL: The Era of Function Approximation

Up to Section 5, we assumed tabular value functions (lookup tables). Once $\mathcal{S}$ is high-dimensional (images, continuous control) or combinatorial (token sequences in an LLM), tabular methods break down and we must use function approximation: Value function approximation ($V_\phi(s)$ or $Q_\theta(s,a)$), or Policy approximation ($\pi_\theta(a\mid s)$). This naturally yields three major families of large-scale RL:

Value-based methods learn a value function and derive a policy (typically via greedy / $\epsilon$-greedy).
Policy-based methods learn the policy directly by maximizing expected return.
Actor-Critic methods combine both: an actor $\pi_\theta$ and a critic ($V_\phi$ or $Q_\phi$).

In LLM post-training, the “state” is the partial sequence $s_t=(x, y_{<t})$, the “action” is the next token $a_t=y_t$, and the reward is often provided by a Reward Model at the sequence level. This mapping will become explicit in Part II.

6.1 Value-Based Methods: Approximating the "Quality"

Value-based RL aims to learn a function that evaluates “how good” it is to take an action in a state, typically the optimal action-value function $Q^*(s,a)$. The policy is then induced by choosing actions with maximal value.

The target of learning: define a neural network $Q_{\theta}$ with parameter $\theta$, the goal is to approximate the true optimal action-value function $Q^*(s,a)$
\[Q_\theta(s,a)\approx Q^*(s,a)\]
Induced policy (common choice): once $Q_{\theta}$ is trained, using $\epsilon$-greedy
\[\pi(a\mid s)= \begin{cases} 1-\epsilon + \frac{\epsilon}{\|\mathcal{A}\|}, & a=\arg\max_{a'}Q_\theta(s,a') \\[10pt] \frac{\epsilon}{\|\mathcal{A}\|}, & \text{otherwise} \end{cases}\]

Value-based methods are historically central in deep RL because they connect directly to the Bellman optimality equation in Section 2. Recall tabular Q-learning (off-policy TD control):

\[Q(s_t,a_t)\leftarrow Q(s_t,a_t)+\alpha\Big[r_{t+1}+\gamma\max_{a'}Q(s_{t+1},a')-Q(s_t,a_t)\Big].\]

DQN replaces the table $Q(\cdot,\cdot)$ with a neural network $Q_\theta(\cdot,\cdot)$ and fits it by minimizing the squared TD error over samples:

\[\delta_t(\theta)=y_t - Q_\theta(s_t,a_t),\]

where $y_t$ represents the TD target

\[y_t=r_{t+1}+\gamma\; (1-\text{done})\; \max_{a'}Q_{\theta^-}(s_{t+1},a').\]

The (mean-squared) DQN loss is:

\[L(\theta)=\mathbb{E}_{(s,a,r,s')\sim\mathcal{D}} \left[\left(r+\gamma\max_{a'}Q_{\theta^-}(s',a')-Q_\theta(s,a)\right)^2\right].\]

Here, $\mathcal{D}$ denotes a replay buffer distribution, and $\theta^-$ are parameters of a target network (explained next).

However, naive “Deep Q-learning” is unstable. If we directly set the target to use the same network $Q_\theta$ on both sides, training often diverges because:

Moving target problem: the target $y_t$ changes immediately when $\theta$ changes (the network “chases its own tail”).
Correlated samples: sequential transitions $(s_t,a_t,r_{t+1},s_{t+1})$ are strongly correlated, violating the implicit i.i.d. assumption behind SGD.
Deadly triad (conceptual): function approximation + bootstrapping + off-policy learning can lead to instability.

DQN’s core contribution is to introduce two stabilizers: Experience Replay and Target Networks

Experience Replay. Store transitions in a buffer:
\[\mathcal{D}=\{(s_t,a_t,r_{t+1},s_{t+1})\}.\]
Instead of learning from consecutive samples, DQN samples mini-batches uniformly (or later, non-uniformly) from $\mathcal{D}$. This breaks temporal correlations, and improves sample efficiency (reuses old data), enables off-policy learning naturally.
Target Network. Maintain a separate network $Q_{\theta^-}$ used only to compute TD targets:
\[y_t=r_{t+1}+\gamma\max_{a'}Q_{\theta^-}(s_{t+1},a').\]
Update $\theta^-$ slowly (e.g., periodic hard update $\theta^- \leftarrow \theta$ every $C$ steps, or Polyak averaging). This makes the target “quasi-stationary”.

DQN family: DQN is the “base chassis”. Many follow-ups improve one of three aspects: targets, representation, or data usage.

Double DQN: decouple action selection and action evaluation to reduce overestimation bias:
\[y_t=r_{t+1}+\gamma Q_{\theta^-}\Big(s_{t+1},\arg\max_{a'}Q_\theta(s_{t+1},a')\Big).\]
Multi-step returns ($n$-step DQN): replace 1-step TD with $n$-step bootstrapping for faster credit assignment:
\[y_t^{(n)}=\sum_{k=0}^{n-1}\gamma^k r_{t+k+1}+\gamma^n\max_{a'}Q_{\theta^-}(s_{t+n},a').\]
Dueling DQN: decompose $Q$ into value + advantage:
\[Q(s,a)=V(s)+\Big(A(s,a)-\frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')\Big),\]
improving learning of state value when many actions are similar.
Prioritized Experience Replay: sample transitions with probability proportional to $|\delta|$ to focus on “surprising” updates.
Noisy Nets: add parameterized noise to encourage exploration without manual $\epsilon$ schedules.
Distributional RL (C51 / QR-DQN): learn a distribution over returns $Z(s,a)$ instead of only its expectation.
Rainbow DQN: a practical combination of multiple improvements (Double + Dueling + PER + multi-step + distributional + noisy exploration).

These are excellent to study if our goal is “deep RL engineering fundamentals”, but for LLM alignment the center of gravity shifts toward policy gradients and actor-critic.

The Key limitation is that DQN requires a maximization over actions:

\[\max_{a'}Q_{\theta^-}(s',a').\]

This becomes problematic when:

$|\mathcal{A}|$ is extremely large (LLM vocab is often $50k$–$200k$),
actions are structured sequences (a “single action” could be a multi-token continuation),
we need stochastic exploration over a vast discrete space.

This is the main reason modern LLM alignment methods overwhelmingly adopt policy gradient / actor-critic families, where sampling from $\pi_\theta$ is natural.

6.2 Policy-Based Methods: Approximating the "Behavior"

Policy-based RL parameterizes the policy $\pi_\theta(a\mid s)$ directly, and optimizes $\theta$ to maximize expected return. This avoids the $\max_{a’}$ bottleneck in large action spaces and is therefore conceptually closer to LLM fine-tuning, where the model already defines a distribution over tokens.

For any episodic sampled from trajectory distribution $\pi_\theta$,

\[\tau = \{s_0, a_0, r_0, \dots, s_{T-1}, a_{T-1}, r_{T-1}, s_T\}, \qquad \tau \sim \pi_\theta\]

define the standard objective:

\[J(\theta)=\mathbb{E}_{\tau\sim \pi_\theta}\left[R(\tau)\right]=\mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\gamma^t r_{t}\right].\]

A trajectory density is:

\[p_\theta(\tau)=\rho(s_0)\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,\mathcal{P}(s_{t+1}\mid s_t,a_t),\]

where $\rho(s_0)$ is the initial-state distribution. Take the derivative of both sides with respect to the parameter $\theta$ yields

\[\begin{align} \nabla_\theta J(\theta) & = \nabla_\theta\int p_\theta(\tau)\,R(\tau)\,d\tau = \int \nabla_\theta p_\theta(\tau)\,R(\tau)\,d\tau \\[10pt] & = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)\,R(\tau)\,d\tau \\[10pt] & = \mathbb{E}_{\tau\sim \pi_\theta}\left[\nabla_\theta \log p_\theta(\tau)\,R(\tau)\right] \end{align}\]

Because environment dynamics $\mathcal{P}$ and the initial-state distribution $\rho(s_0)$ do not depend on $\theta$,

\[\log p_\theta(\tau)=\log \rho(s_0)+\sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t)+\sum_{t=0}^{T-1}\log \mathcal{P}(s_{t+1}\mid s_t,a_t),\]

\[\nabla_\theta\log p_\theta(\tau)=\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t).\]

Substitute back:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,R(\tau) \right].\]

Attaching the full return ($R(\tau)$) to every $\nabla_\theta \log \pi$ term is unbiased but high variance, because the action at time $t$ cannot affect rewards that happened before time $t$. We can remove this “past-reward noise” without changing the expectation. Define the reward-to-go $G_t$, and the past part $P_t$ (up to $t-1$), then the full return decomposes as:

\[R(\tau) = P_t + G_t.\]

where

\[G_t=\sum_{k=t}^{T-1}\gamma^{k}r_k, \qquad P_t=\sum_{k=0}^{t-1}\gamma^k r_k.\]

The past return $P_t$ is fixed once you condition on the history, it does not depend on $a_t$, the inner conditional expectation satisfies

\[\begin{align} \mathbb E_{a_t \sim \pi_\theta(\cdot|s_t)} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \right] & = \sum_{a_t} \pi_\theta(a_t|s_t)\nabla_\theta \log \pi_\theta(a_t|s_t) \\[10pt] & = \sum_{a_t} \nabla_\theta \pi_\theta(a_t|s_t) = \nabla_\theta \sum_{a_t} \pi_\theta(a_t|s_t) \\[10pt] & = \nabla_\theta 1 = 0 \end{align}\]

Thus the inner expectation is zero, hence the whole past term is zero.

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,P_t \right] = 0.\]

This yields the familiar REINFORCE algorithm [^REINFORCE]: A standard refinement is to replace the full-trajectory return $R(\tau)$ with the reward-to-go from time $t$:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,G_t \right]\]

We apply the tower property (law of total expectation) with conditioning on $(s_t, a_t)$,

\[\begin{align} \mathbb{E}[\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,G_t] & = \mathbb{E}\Big[\mathbb{E}\big[\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,G_t \mid s_t,a_t\big]\Big] \\[10pt] & = \mathbb{E}\Big[\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,\mathbb{E}\big[G_t \mid s_t,a_t\big]\Big] \end{align}\]

By definition of the action-value function,

\[Q^{\pi_\theta}(s_t,a_t)\;\doteq\;\mathbb{E}\!\left[G_t \mid s_t=s,\;a_t=a\right],\]

Substituting this equality for every $t$ yields

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t\mid s_t)\,Q^{\pi_\theta}(s_t,a_t) \right].\]

Finally, expressing the result in terms of occupancy measure $d^{\pi_\theta}(s)$ gives the Policy Gradient Theorem:

\[\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\,a\sim\pi_\theta(\cdot\mid s)} \left[\nabla_\theta\log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a)\right].\]

6.3 Variance Reduction: Baselines and Advantage Functions

At this point, the Policy Gradient Theorem gives us a clean conceptual target:

\[\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\,a\sim\pi_\theta(\cdot\mid s)} \left[\nabla_\theta\log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a)\right].\]

However, it is not yet a practical algorithm: the true $Q^{\pi_\theta}(s,a)$ is unknown and must be estimated from samples. The remainder of modern policy optimization can be understood as a sequence of increasingly “engineered” estimators and update rules that make this gradient both low-variance and stable at scale (which is exactly what we need for LLM post-training).

A key identity behind almost all practical policy gradient methods is that we may subtract a baseline $b(s)$ without changing the expected gradient:

\[\mathbb{E}_{a\sim\pi_\theta(\cdot\mid s)} \big[\nabla_\theta\log\pi_\theta(a\mid s)\,b(s)\big] = b(s)\,\nabla_\theta \sum_a \pi_\theta(a\mid s) = 0.\]

Therefore,

\[\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\,a\sim\pi_\theta(\cdot\mid s)} \left[\nabla_\theta\log \pi_\theta(a\mid s)\,\big(Q^{\pi_\theta}(s,a)-b(s)\big)\right].\]

The “best” baseline (in the sense of minimizing variance) is closely related to the state-value function $V^{\pi_\theta}(s)$. Choosing $b(s)=V^{\pi_\theta}(s)$ yields the advantage form:

\[A^{\pi_\theta}(s,a)\;\doteq\;Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s),\]

and the gradient becomes

\[\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\,a\sim\pi_\theta(\cdot\mid s)} \left[\nabla_\theta\log \pi_\theta(a\mid s)\,A^{\pi_\theta}(s,a)\right].\]

Interpretation. REINFORCE pushes up the probability of actions with high return. The advantage form pushes up the probability of actions that are better than the policy’s default behavior at that state, which is usually a much lower-variance learning signal.

This is the first major bridge from “pure REINFORCE” to large-scale deep RL: once $V^{\pi_\theta}(s)$ appears, we can approximate it with a neural network critic.

6.4 Actor-Critic (AC) Methods: The Hybrid Paradigm

Actor-Critic methods instantiate the advantage form with two function approximators:

Actor: a policy network $\pi_\theta(a\mid s)$
Critic: a value network $V_\phi(s)$ (or sometimes $Q_\phi(s,a)$)

The central idea is to replace Monte Carlo returns (high variance) with bootstrapped estimates (lower variance) learned by the critic.

6.4.1 From Monte Carlo to TD: A One-Step Advantage Estimator

Define the 1-step TD residual (TD error):

\[\delta_t \;\doteq\; r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t).\]

Under mild conditions, $\delta_t$ is an unbiased estimator of an advantage-like quantity (when $V_\phi\approx V^{\pi_\theta}$). Thus, a very common actor update is:

\[\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;\hat A_t, \qquad \hat A_t \approx \delta_t.\]

Meanwhile, the critic is trained by regression to a bootstrap target:

\[\phi \leftarrow \arg\min_\phi \sum_{t=0}^{T-1} \left( r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \right)^2.\]

Bias–variance trade-off. Compared to REINFORCE (Monte Carlo), TD bootstrapping typically introduces some bias (because $V_\phi$ is learned) but dramatically reduces variance and enables learning from partial trajectories—critical in large-scale settings.

6.4.2 Generalized Advantage Estimation (GAE): A Tunable Bias–Variance Knob

One-step TD may be too biased, while Monte Carlo returns may be too noisy. Generalized Advantage Estimation (GAE) provides a smooth interpolation between them. First keep the same TD residual $\delta_t$:

\[\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t).\]

GAE defines an exponentially-weighted sum of future TD residuals:

\[\hat A^{\text{GAE}(\gamma,\lambda)}_t \;\doteq\; \sum_{l=0}^{T-1-t} (\gamma\lambda)^l\,\delta_{t+l}, \qquad \lambda\in[0,1].\]

$\lambda\to 0$: mostly 1-step TD (lower variance, higher bias)
$\lambda\to 1$: approaches Monte Carlo advantage (lower bias, higher variance)

In practice, $\hat A_t$ is usually computed with a backward recursion:

\[\hat A_t = \delta_t + \gamma\lambda\,\hat A_{t+1}, \qquad \hat A_{T}=0.\]

GAE is the standard advantage estimator used in TRPO/PPO-style policy optimization because it yields stable learning signals while remaining on-policy-friendly.

6.5 Importance Sampling for Policy Gradients

The policy gradient theorem is inherently on-policy: expectations are taken under $d^{\pi_\theta}$ and $a\sim\pi_\theta(\cdot\mid s)$.

However, the policy network is parametered by $\theta$, which would be updated by multiple gradient steps, the data for each iteration is collected by the most recent snapshot of the behavior policy,

\[\pi_{\text{old}} \equiv \pi_{\theta_{\text{old}}},\]

the current policy we are optimizing is $\pi_\theta$, after the first update we generally have $\pi_\theta \neq \pi_{\text{old}}$ while the samples $(s_t,a_t)$ still come from $\pi_{\text{old}}$, this introduces a mismatch between $\pi_\theta$ and $\pi_{\text{old}}$.

This mismatch is corrected using importance sampling. Let trajectories be collected by $\pi_{\text{old}}$, but we want to optimize $\pi_\theta$. For a single time step, define the importance ratio:

\[\rho_t(\theta)\;\doteq\;\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}.\]

A commonly used surrogate objective is

\[L^{\text{PG}}(\theta) \;\doteq\; \mathbb{E}_{(s_t,a_t)\sim \pi_{\text{old}}} \left[ \rho_t(\theta)\,\hat A_t \right],\]

whose gradient yields an importance-weighted policy gradient estimator:

\[\nabla_\theta L^{\text{PG}}(\theta) = \mathbb{E}_{\pi_{\text{old}}} \left[ \rho_t(\theta)\,\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,\hat A_t \right].\]

The new problem. If $\pi_\theta$ moves too far away from $\pi_{\text{old}}$, the ratio $\rho_t(\theta)$ can explode (or vanish), producing high-variance updates and unstable learning. This motivates the next generation: trust-region / proximal policy optimization, which explicitly constrains how much the policy may change per update.

6.6 Stability in Policy Updates: TRPO and PPO

In deep RL (and especially in LLM alignment), the dominant failure mode of vanilla policy gradients is not “lack of signal,” but policy update instability: a few aggressive gradient steps can shift $\pi_\theta$ so much that old data becomes irrelevant, critics become inaccurate, and performance collapses.

6.6.1 TRPO: Trust Region Policy Optimization

TRPO formalizes “do not move too far” as a constrained optimization problem:

\[\max_\theta\; \mathbb{E}_{\pi_{\text{old}}}\big[\rho_t(\theta)\,\hat A_t\big] \quad\text{s.t.}\quad \mathbb{E}_{s\sim d^{\pi_{\text{old}}}} \Big[ D_{\mathrm{KL}}\big(\pi_{\text{old}}(\cdot\mid s)\,\|\,\pi_\theta(\cdot\mid s)\big) \Big] \le \delta.\]

The objective is the importance-sampled surrogate improvement.
The constraint limits the average KL divergence between the old and new policies.

In practice, TRPO approximately solves this constraint using a second-order approximation (Fisher information / conjugate gradient) plus a line search. Conceptually, it is “principled,” but the implementation is heavy.

6.6.2 PPO: A First-Order, Practical Approximation to TRPO

PPO keeps the same surrogate idea but replaces the hard KL constraint with an objective that discourages large policy updates using clipping.

Define the same ratio $\rho_t(\theta)=\pi_\theta(a_t\mid s_t)/\pi_{\text{old}}(a_t\mid s_t)$, then PPO optimizes:

\[L^{\text{CLIP}}(\theta) \;\doteq\; \mathbb{E}_{\pi_{\text{old}}}\Big[ \min\Big( \rho_t(\theta)\,\hat A_t,\; \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon)\,\hat A_t \Big) \Big].\]

If $\rho_t(\theta)$ stays within $[1-\epsilon,\,1+\epsilon]$, PPO behaves like the standard importance-sampled policy gradient.
If $\rho_t(\theta)$ goes out of range, the clipped term prevents the objective from encouraging further movement in that direction.

This simple mechanism approximates a “trust region” using only first-order optimization, enabling stable training with mini-batches and multiple epochs—precisely why PPO became the default choice in large-scale actor-critic systems and RLHF pipelines.

Part II — Preference Alignment for Diffusion and Flow-Matching

Diffusion / flow-matching generative models are typically trained by (approximate) maximum likelihood or score/flow matching. That objective makes the model reproduce the training data distribution—but “preferred” generations (better aesthetics, stronger prompt faithfulness, style consistency, harmlessness, product-fitness, etc.) often define a different target distribution. Preference alignment, in this article, means modifying a pretrained diffusion/flow model so that its output distribution shifts toward samples that rank higher under human (or human-trained) preferences, while controlling the shift away from the base model.

This is precisely where RL (and RL-style objectives such as KL-regularized policy optimization, reward-weighted likelihood, and DPO-like pairwise objectives) becomes a natural toolbox: it provides distribution shaping under feedback rather than dataset reconstruction.

7. Unifying View: Denoising / Flow as a Policy Over Trajectories

In diffusion and flow-matching models, generation is not a one-shot prediction but a trajectory-producing process. A sample is obtained by evolving an initial noise variable toward a clean image or latent:

\[x_T \rightarrow x_{T-1} \rightarrow \cdots \rightarrow x_1 \rightarrow x_0.\]

From the reinforcement learning perspective, this entire sampling process can be interpreted as one episode. The denoising model is the policy, each intermediate noisy latent is a state, each denoising update is an action, and the final generated sample receives a reward.

More formally, for a conditional generation task with condition $c$, such as a text prompt, we define the state at step $t$ as

\[s_t = (x_t, t, c),\]

where $x_t$ is the current noisy latent, $t$ is the current noise level or time index, and $c$ is the conditioning signal. The policy is the pretrained diffusion or flow model:

\[\pi_\theta(a_t \mid s_t).\]

The precise definition of the action depends on the sampler. In a stochastic DDPM-style sampler, a natural choice is

\[a_t = x_{t-1},\]

so the policy is exactly the reverse denoising distribution:

\[\pi_\theta(a_t \mid s_t) = p_\theta(x_{t-1} \mid x_t, c).\]

For common parameterizations, the model may instead output noise, clean data, score, or velocity:

\[\epsilon_\theta(x_t,t,c), \qquad \hat{x}_{0,\theta}(x_t,t,c), \qquad s_\theta(x_t,t,c), \qquad v_\theta(x_t,t,c).\]

These outputs are not always the action itself, but they determine the action through the sampler update rule:

\[x_{t-1} = \mathrm{SamplerStep}(x_t, t, c; \theta).\]

Therefore, the action can be viewed either as the next latent $x_{t-1}$, or as the model prediction that induces this next latent.

The transition dynamics are then simple. If the action is defined as $a_t=x_{t-1}$, the environment transition is deterministic:

\[s_{t-1} = (x_{t-1}, t-1, c).\]

Thus,

\[\mathcal{P}(s_{t-1}\mid s_t,a_t)=1.\]

If the action is defined as $\epsilon_\theta$, $v_\theta$, or $\hat{x}_{0,\theta}$, then the transition is induced by the chosen sampler, such as DDPM, DDIM, Euler, Heun, DPM-Solver, or an ODE solver.

The reward is usually assigned to the final generated sample:

\[R(x_0,c).\]

It can measure image quality, prompt alignment, aesthetics, human preference, identity consistency, safety, or task-specific utility. In most diffusion RL formulations, the intermediate rewards are zero:

\[r_t = 0,\quad t>0,\]

and only the terminal sample receives a reward:

\[r_0 = R(x_0,c).\]

Thus, the trajectory is

\[\tau = (x_T, x_{T-1}, \ldots, x_0),\]

and the RL objective becomes

\[J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta} \left[ R(x_0,c) \right].\]

For stochastic diffusion samplers, the trajectory likelihood factorizes as

\[p_\theta(\tau\mid c) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}\mid x_t,c).\]

Therefore,

\[\log p_\theta(\tau\mid c) = \log p(x_T) + \sum_{t=1}^{T} \log p_\theta(x_{t-1}\mid x_t,c).\]

This is the key reason why policy-gradient methods such as DDPO can assign the terminal reward back to all denoising steps:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \left[ R(x_0,c) \sum_{t=1}^{T} \nabla_\theta \log p_\theta(x_{t-1}\mid x_t,c) \right].\]

The diffusion model is therefore not merely a generator of $x_0$; it is a trajectory policy over the entire reverse process.

For flow-matching or deterministic ODE samplers, the situation is slightly different. The dynamics are often written as

\[\frac{dx_t}{dt} = v_\theta(x_t,t,c).\]

Here, the model defines a deterministic velocity field rather than an explicit stochastic transition distribution. The generated sample is still produced by a trajectory, but the policy-gradient interpretation requires either a stochastic relaxation, an equivalent SDE formulation, or a surrogate likelihood over discretized solver steps. This is why PPO-style methods are more direct for stochastic diffusion samplers, while flow/ODE models often motivate pathwise reward gradients, adjoint methods, or preference objectives based on trajectory/score surrogates.

The correspondence can be summarized as follows:

RL concept	Diffusion / flow generative model
Episode	One complete sampling process from $x_T$ to $x_0$
State $s_t$	Current latent $x_t$, time/noise level $t$, condition $c$
Action $a_t$	Next latent $x_{t-1}$, or model output such as $\epsilon_\theta$, $v_\theta$, $\hat{x}_{0,\theta}$
Policy $\pi_\theta$	Denoising model or velocity model
Transition $\mathcal{P}$	Sampler-induced update rule
Reward $R$	Final image preference, quality, alignment, safety, or task score
Trajectory $\tau$	Full reverse path $x_T\to x_0$

This view is the conceptual bridge between classical RL and preference alignment for diffusion models. Once generation is understood as a policy over denoising trajectories, different alignment algorithms mainly differ in how they estimate and optimize the same underlying objective: increasing the expected reward of generated samples while preventing the learned policy from drifting too far away from the pretrained reference model.

8. Technique Routes to Implement

Once we adopt the view of Section 7 — namely, that denoising / flow can be interpreted as a policy over trajectories — preference alignment methods for diffusion and flow models can be organized into a small number of recurring technical routes. They differ mainly in where the preference signal enters the system, whether the base model parameters are updated, and how the terminal preference over samples is converted into an optimization signal over the generation trajectory.

In practice, the current landscape is best understood as seven routes, ranging from purely inference-time control to full online RL, pairwise preference optimization, binary-feedback objectives, offline reward-weighted regression, and black-box post-hoc alignment. The unifying goal is always the same: to shift the output distribution of a pretrained generator toward samples that are more preferred by humans (or by a learned proxy), while controlling collapse, reward hacking, and excessive drift from the base model.

Route A: Inference-time Reward / Energy Guidance
Core idea: do not update the model parameters $ \theta $. Instead, modify the sampling dynamics at inference time so that generation is steered toward regions with higher reward or lower energy.
Canonical form: during sampling, we augment the denoising / flow dynamics with a reward-gradient term,
\[\frac{dx}{dt}= v_\theta(x,t,c) + \lambda(t)\nabla_x s(x,c),\]
where $s(x,c)$ is a differentiable score, energy, or utility surrogate, such as CLIP similarity, an aesthetic predictor, or another reward-like signal.
What is shared across this route:
- No parameter update: the base model remains unchanged.
- Immediate usability: no training loop is needed; alignment happens purely at test time.
- Depends on gradient quality: it works best when a reliable differentiable guidance signal is available.
- Typical failure modes: if the guidance strength is too large, images may become oversaturated, distorted, or less diverse. Also, the preference improvement is not “stored” in the model unless later distilled.
Representative algorithms / frameworks:
- Classifier Guidance as the classical precursor;
- CLIP-guided diffusion / GLIDE-style guidance ¹;
- more general energy-guided or reward-guided sampling frameworks.
Strictly speaking, this route is not parameter fine-tuning; it is better viewed as training-free alignment at inference time.

Route B: Differentiable Reward Direct Optimization
Core idea: if the reward is differentiable, and the generation process is differentiable with respect to model parameters, then we can skip score-function estimators and directly optimize the reward by backpropagating through the sampling trajectory.
Canonical objective: if the final sample is written as $x_0 = G_\theta(\epsilon, c)$, then
\[\nabla_\theta \mathbb{E}[R(x_0,c)]= \mathbb{E}_{\epsilon} \left[ \nabla_{x_0}R(x_0,c)\cdot \frac{\partial x_0}{\partial \theta} \right].\]
The main difficulty is that $\frac{\partial x_0}{\partial \theta}$ is not local: it depends on the entire denoising or flow trajectory. In discrete samplers, this leads to backpropagation through time (BPTT); in continuous-time formulations, it leads naturally to adjoint-based or stochastic-control-style derivations.
What is shared across this route:
- the gradient is usually lower variance than REINFORCE/PPO-style estimators;
- the method is often memory- and compute-heavy, because sampling itself becomes part of the computation graph;
- the central engineering challenge is how to penetrate the trajectory efficiently — full backpropagation, truncated backpropagation, randomized-step backpropagation, or adjoint approximations;
- because the reward is optimized directly, these methods must carefully address over-optimization and reward hacking.
Representative algorithms:
- AlignProp ², which backpropagates reward gradients through a randomized number of denoising steps;
- DRaFT ³, together with DRaFT-K and DRaFT-LV, which explicitly study full-trajectory vs. truncated gradient propagation;
- Adjoint Matching ⁴ ⁵, which recasts reward fine-tuning for flow and diffusion models through a stochastic optimal control lens and derives an adjoint-style regression objective.

Route C: Online RL / Policy-Gradient Style Optimization
Core idea: if the reward is non-differentiable, partially observed, human-provided, or otherwise black-box, then the natural route is to treat generation as a trajectory policy and optimize it using policy gradients.
Canonical objective: starting from
\[\nabla_\theta \mathbb{E}_{x_0\sim p_\theta}[R(x_0,c)]= \mathbb{E}_{x_0\sim p_\theta} \left[ R(x_0,c)\,\nabla_\theta \log p_\theta(x_0\mid c) \right],\]
diffusion methods use the reverse chain to obtain a tractable decomposition,
\[\nabla_\theta \log p_\theta(x_0\mid c) \approx \sum_{t=1}^{T} \nabla_\theta \log p_\theta(x_{t-1}\mid x_t,c).\]
This turns terminal reward optimization into a trajectory-level policy-gradient problem.
What is shared across this route:
- it works with arbitrary reward signals, including human preference and black-box evaluators;
- it usually suffers from high variance, low sample efficiency, and difficult credit assignment across long denoising chains;
- stabilization mechanisms such as advantages, baselines, KL regularization, clipping, and group-relative normalization are central;
- for deterministic flow / ODE generators, one often needs an additional trick to recover RL-style exploration, such as converting an ODE into an equivalent stochastic process.
Representative algorithms:
- DDPO ⁶, the canonical PPO-style route for diffusion models;
- DPOK ⁷, which combines policy optimization with KL regularization for text-to-image diffusion;
- Flow-GRPO ⁸, which extends online RL to flow-matching models by introducing an ODE-to-SDE conversion for exploration;
- more recent GRPO-style extensions such as DenseGRPO ⁹ and MV-GRPO, which try to address sparse terminal reward and coarse credit assignment.

Route D: Pairwise Preference Optimization (DPO-Style)
Core idea: instead of learning a scalar reward model and then doing RL, directly optimize the generator from pairwise preference data $(x^+, x^-)$ under the same condition $c$.
In classical DPO, the key object is a log-ratio against a reference policy. In diffusion and flow models, the main technical obstacle is that the generator usually does not expose a simple tractable sample likelihood for the final image $x_0$. Therefore, diffusion-specific DPO methods must construct a surrogate notion of log-probability or policy ratio — for example through ELBO-based approximations, reverse-step decompositions, score-based surrogates, inversion-based surrogates, or explicitly step-aware local preference objectives.
Canonical form: define
\[\Delta_\theta(x,c) = \log p_\theta(x\mid c)-\log p_{\text{ref}}(x\mid c),\]
and optimize a logistic ranking loss,
\[\max_\theta \; \mathbb{E}_{(x^+,x^-)} \Big[ \log \sigma\big(\beta(\Delta_\theta(x^+,c)-\Delta_\theta(x^-,c))\big) \Big].\]
What is shared across this route:
- it is often more stable than online RL, because there is no explicit rollout-critic loop;
- it still needs some mechanism to keep the model close to a reference policy;
- the key methodological question is always: what surrogate of policy likelihood is appropriate for diffusion / flow trajectories?
Representative algorithms:
- Diffusion-DPO ¹⁰, which adapts DPO to diffusion via an ELBO-based likelihood surrogate;
- SPO / Step-aware Preference Optimization, which pushes preference supervision to intermediate denoising steps;
- later diffusion-specific DPO variants that improve the surrogate or the data construction pipeline.
This family should be viewed as the diffusion analogue of “RL-free preference optimization,” but its hard part is not the logistic loss itself; it is the construction of a faithful trajectory-aware preference surrogate.

Route E: Binary-Feedback / KTO-Style Optimization
Core idea: in many real systems, preference data does not arrive as carefully curated pairwise comparisons. Instead, one often observes only per-sample binary feedback, such as like/dislike, accept/reject, or positive/negative utility.
KTO-style methods adapt preference optimization to this weaker but cheaper feedback regime. Instead of comparing two outputs directly, they treat each output independently and optimize a utility-shaped objective relative to a reference model.
Generic form:
\[\max_\theta\; \mathbb{E} \Big[ y \log \sigma(\beta\Delta_\theta(x,c)) + (1-y)\log \sigma(-\beta\Delta_\theta(x,c)) \Big],\]
where $y\in{0,1}$ is binary feedback and $\Delta_\theta$ again measures the log-ratio relative to a reference policy.
What is shared across this route:
- data collection is often simpler and cheaper than pairwise preference annotation;
- the signal is usually weaker and more sensitive to calibration, imbalance, and noise;
- as in DPO-style methods, the key technical issue is still how to define a stable diffusion-specific surrogate for the policy ratio.
Representative algorithm:
- Diffusion-KTO ¹¹, which explicitly formulates text-to-image diffusion alignment from independent binary human utility signals rather than pairwise comparisons.

Route F: Offline Reward-Weighted Regression / Likelihood
Core idea: instead of doing online RL, we may first obtain a dataset of generated samples together with scalar rewards or preference-derived scores, and then fine-tune the generator using reward-weighted maximum likelihood or an equivalent weighted regression objective.
Canonical objective:
\[\max_\theta \; \mathbb{E}_{(x,c,R)\sim\mathcal{D}} \big[ w(R)\log p_\theta(x\mid c) \big]- \beta\,\mathrm{KL}(p_\theta\|p_{\text{ref}}),\]
where the weight $w(R)$ is a monotone function of reward, such as $\exp(R/\alpha)$ or a rank-based transformation.
What is shared across this route:
- it is typically more stable and easier to implement than online policy gradients;
- it avoids the variance of RL, but can be limited by the support of the offline dataset;
- it is sensitive to reward scale, temperature choice, and whether the high-reward subset still covers enough diversity.
Representative algorithm:
- Aligning Text-to-Image Models using Human Feedback, which learns a reward model from human feedback and then fine-tunes the text-to-image model by maximizing reward-weighted likelihood.
Conceptually, this route sits between supervised fine-tuning and RL: the generator is still optimized by a likelihood-style objective, but the data are reweighted by human preference.

Route G: Black-box Post-hoc Alignment without Updating the Base Model
Core idea: sometimes the base generator is not accessible for gradient-based fine-tuning at all. We may only have input-output access to a powerful but closed model. In that setting, the alignment problem changes: rather than updating the base model, we learn an auxiliary transformation or post-hoc preference-improving map on top of its outputs.
This is the setting addressed by Preference Flow Matching (PFM). Rather than fine-tuning the original model, PFM learns a flow that transforms less-preferred outputs into more-preferred ones using preference data. The base generator can remain untouched, which makes this route especially attractive when working with APIs or proprietary systems.
What is shared across this route:
- it is suitable for black-box or partially inaccessible generators;
- it avoids modifying the base policy parameters;
- it is better described as post-hoc alignment than as standard fine-tuning;
- its optimization target is still preference alignment, but the learned object is an external preference-improving transport / flow, not the original denoiser itself.
Representative algorithm:
- PFM (Preference alignment with flow matching) ¹².
This route is important because it widens the alignment toolbox beyond the usual assumption that the underlying diffusion or flow model can be directly optimized.

Part III — Differentiable Reward Optimization

In Part II, we listed several routes for preference alignment. Among them, Differentiable Reward Direct Optimization is the most “direct” analogue of pathwise optimization: if the reward model is differentiable w.r.t. pixels (or latent), we can backpropagate reward gradients through the entire denoising / flow trajectory and update the generator parameters without REINFORCE/PPO-style score-function estimators.

This chapter focuses on one core technical question: How do we compute and implement the gradient

\[\nabla_\theta J(\theta) \;=\; \nabla_\theta \ \mathbb{E}_{x_0\sim p_\theta(\cdot|c)}[R(x_0,c)]\]

when $x_0$ is produced by a multi-step diffusion / ODE sampling trajectory ?

We will first derive the full gradient-through-trajectory formula step-by-step, and then show how it becomes standard BPTT (or an adjoint method in continuous time), and finally summarize the key truncation / stabilization strategies used by core algorithms such as ReFL, DRaFT(-K/-LV), AlignProp, and VADER.

9. Discrete-time diffusion sampling as a differentiable computation graph

We start from the most implementation-relevant setting: a discrete-time sampler (e.g., DDIM, DDPM, EDM discretizations). A reverse trajectory (episode) is:

\[\tau \;=\; (x_T, x_{T-1}, \ldots, x_0), \qquad x_T\sim\mathcal{N}(0,I).\]

At each denoising step, the model outputs an “action” such as $\epsilon_\theta(x_t,t,c)$ (noise prediction) and the sampler deterministically updates $x_{t-1}$ from $x_t$. Concretely, we writes a DDIM-style update via a predicted clean latent $\hat{x}_0$:

\[\hat{x}_0 = \frac{x_t-\sigma_t\epsilon_\theta(x_t,c,t)}{\alpha_t}, \qquad x_{t-1} = \alpha_{t-1}\hat{x}_0+\sigma_{t-1}\epsilon_\theta(x_t,c,t).\]

This is crucial: sampling is a long, differentiable chain (unless we insert stop-gradient), so the final sample is a deterministic function of model parameters and the initial noise:

\[x_0 \;=\; G_\theta(x_T,c).\]

We assume a differentiable reward $R(x_0,c)$ (e.g., CLIP-like, aesthetic, pickscore-type, video reward, etc.). The objective is:

\[J(\theta)\;=\;\mathbb{E}_{c\sim D_c,\;x_T\sim\mathcal{N}(0,I)}\Big[R\big(G_\theta(x_T,c),c\big)\Big].\]

Pathwise (reparameterization) gradient. Since randomness enters only through $x_T$, we can differentiate through $G_\theta$:

\[\nabla_\theta J(\theta) = \mathbb{E}_{c,x_T}\Big[ \nabla_\theta R(x_0,c) \Big] = \mathbb{E}_{c,x_T}\Big[ \underbrace{\nabla_{x_0}R(x_0,c)}_{\text{reward gradient}} \cdot \underbrace{\frac{\partial x_0}{\partial \theta}}_{\text{trajectory Jacobian}} \Big].\]

So the entire problem reduces to computing trajectory Jacobian $\tfrac{\partial x_0}{\partial\theta}$ when $x_0$ is the endpoint of an unrolled chain.

Let the (possibly solver-dependent) one-step update be:

\[x_{t-1}=F_t(x_t,\theta,c), \quad\text{for}\;\;t=T,\ldots,1.\]

Then by repeated substitution,

\[x_0 = F_1(F_2(\cdots F_T(x_T,\theta,c)\cdots)).\]

Differentiate $R(x_0,c)$ w.r.t. $\theta$ by the chain rule:

\[\nabla_\theta R(x_0,c) = \frac{\partial R}{\partial x_0} \cdot \frac{\partial x_0}{\partial\theta}.\]

Now expand $\frac{\partial x_0}{\partial\theta}$ through the composition. A clean way is to define Jacobians:

\[A_t \;\triangleq\;\frac{\partial x_{t-1}}{\partial x_t}\in\mathbb{R}^{d\times d}, \qquad B_t \;\triangleq\;\frac{\partial x_{t-1}}{\partial\theta}\in\mathbb{R}^{d\times |\theta|}.\]

From $x_{t-1}=F_t(x_t,\theta,c)$, we have:

\[\frac{\partial x_{t-1}}{\partial\theta} = \underbrace{\frac{\partial F_t}{\partial\theta}}_{B_t} + \underbrace{\frac{\partial F_t}{\partial x_t}}_{A_t} \cdot \frac{\partial x_t}{\partial\theta}.\]

Unrolling this recursion yields the “sum over time” decomposition:

\[\begin{align} \frac{\partial x_0}{\partial\theta} & = B_1 + A_1B_2 + A_1A_2B_3 +\cdots+ A_1A_2\cdots A_{T-1}B_T \\[10pt] & = \sum_{k=1}^{T} \left(B_k (\prod_{j=1}^{k-1}A_j)\right). \end{align}\]

Multiplying by $\frac{\partial R}{\partial x_0}$, we obtain:

\[\nabla_\theta R(x_0,c) = \frac{\partial R}{\partial x_0}\Big( B_1 + A_1B_2 +\cdots+ A_1\cdots A_{T-1}B_T \Big).\]

This expression is mathematically complete, but computing products of full Jacobians is not how we implement it. Instead, we use the standard backpropagation-through-time (BPTT) view.

Define an “adjoint” (reverse-mode) vector at each timestep:

\[\lambda_t \;\triangleq\; \frac{\partial R}{\partial x_t}\in\mathbb{R}^d.\]

By definition, $\lambda_0=\frac{\partial R}{\partial x_0}$. For $t=1,\ldots,T$, because $x_{t-1}=F_t(x_t,\theta,c)$,

\[\lambda_t = \lambda_{t-1}\cdot \frac{\partial x_{t-1}}{\partial x_t} = \lambda_{t-1}\cdot A_t.\]

Now the gradient w.r.t. parameters is:

\[\nabla_\theta R(x_0,c)= \sum_{t=1}^{T} \lambda_{t-1}\cdot\frac{\partial x_{t-1}}{\partial\theta} = \sum_{t=1}^{T} \lambda_{t-1}\cdot B_t.\]

This is the exact “penetrate-the-trajectory” formula in a computational form:

Forward pass: generate sample from $x_T$ to $x_0$.
Backward pass: propagate $\lambda_0\to \lambda_T$ through step Jacobians, and accumulate $\sum_t \lambda_{t-1}B_t$.

In practice, modern autograd frameworks do this automatically once the sampling loop is unrolled (or checkpointed), which is precisely the motivation behind DRaFT and AlignProp-style implementations. DRaFT explicitly frames multiple methods (DRaFT / DRaFT-K / DRaFT-LV / ReFL) under a unified “stop-gradient inside the sampling loop” algorithm.

10. Continuous-time view: flow matching / diffusion ODE and the adjoint method

The same idea becomes even cleaner in continuous time. Assume the sampler is an ODE (e.g., probability flow ODE / diffusion ODE / flow matching sampler):

\[\frac{dx(t)}{dt} = v_\theta(x(t),t,c), \qquad x(T)=x_T.\]

Let the terminal generated sample be (x(0)) and the reward be $R(x(0),c)$. Then:

\[\nabla_\theta R(x(0),c) = \int_{0}^{T} \underbrace{a(t)^\top}_{\text{adjoint}} \underbrace{\frac{\partial v_\theta(x(t),t,c)}{\partial \theta}}_{\text{parameter sensitivity}} \;dt,\]

where the adjoint satisfies the reverse-time ODE:

\[\frac{da(t)}{dt} = \left(\frac{\partial v_\theta(x(t),t,c)}{\partial x}\right)^\top a(t), \qquad a(0)=\nabla_x R(x(0),c).\]

This is the continuous analogue of BPTT: instead of multiplying Jacobians across discrete steps, we integrate an adjoint ODE backward in time. In implementations, many works still discretize and rely on autograd/checkpointing, but conceptually the “reward gradients penetrate the trajectory” statement is identical in discrete and continuous time.

11. A canonical training template (full backprop vs. truncated backprop)

At this point, the algorithmic template is almost trivial:

Sample $c\sim D_c$, noise $x_T\sim\mathcal{N}(0,I)$
Run the sampler $x_T\to x_0$
Compute reward $R(x_0,c)$
Backpropagate $\nabla_\theta R(x_0,c)$ through the sampler and update $\theta$
\[\begin{align} \theta & = \theta + lr * \nabla_\theta R(x_0,c) \\[10pt] & = \theta + lr*\sum_{t=0}^{T}\frac{\partial R(x_0,c)}{\partial x_t}\cdot \frac{\partial x_t}{\partial \theta}. \end{align}\]

However, two practical constraints dominate real systems:

Memory / compute: full BPTT stores or recomputes activations across $T$ steps.
Over-optimization (reward hacking / collapse): pushing too hard on a learned reward can collapse diversity.

Therefore, the key engineering question becomes: How many denoising steps should we differentiate through? Full chain (BPTT) or truncated (TBTT), and how to choose the truncation length?

12. Truncation and stabilization strategies

We now organize the main truncation strategies as a sequence of increasingly general “stop-gradient policies” inside the sampling trajectory. A convenient unifying view is precisely DRaFT’s Algorithm 1, which shows how ReFL, DRaFT, DRaFT-K, and DRaFT-LV can all be written by choosing a truncation point $t_{\text{truncate}}$ and inserting stop_grad in the sampling loop.

12.1 ReFL: “early stop + one-step gradient at a late timestep”

Core idea. ReFL observes that reward scores (e.g., ImageReward) become informative only after a sufficiently large number of denoising steps, and proposes to pick a random late timestep (t), run denoising without gradients down to (t), then do a single differentiable denoising step and treat its output as a proxy of (x_0), backpropagating reward only through that last step. Algorithmically, ReFL:

samples a random $t\in[T_1,T_2]$,
performs steps $T\to t+1$ with no grad,
performs step $t\to t-1$ with grad,
decodes/predicts a proxy image and applies a reward-to-loss map.

This corresponds to inserting a stop-gradient and breaking the loop early in the DRaFT unified template.

Stability note. ReFL motivates randomizing $t$ (instead of always using the final step) for stability: it reports that using only the last denoising step’s gradient is unstable and leads to bad results, hence random late-step selection and additional regularization with the pre-training loss.

12.2 DRaFT: full backprop through sampling (BPTT) + memory tricks

Core idea. DRaFT treats sampling as a differentiable computation graph and backpropagates reward gradients through the sampling process into LoRA weights.

Two practical “enablers” are highlighted:

LoRA: update only low-rank adapters, freezing base weights, reducing memory and making fine-tuning lightweight.
Gradient checkpointing: store only selected activations (e.g., per-step latents) and recompute the rest during backward, reducing memory at the cost of compute. DRaFT explicitly describes storing only the input latent per denoising step and re-materializing UNet activations.

This makes full BPTT feasible (at least in principle), but DRaFT also finds that truncation can improve both compute and sample efficiency, leading to DRaFT-K.

12.2.1 DRaFT-K: fixed-K truncated backpropagation (TBTT)

Core idea. Differentiate through only the last (K) denoising steps. DRaFT states this directly: “DRaFT-K truncates the backward pass, differentiating through only the last K steps of sampling.”

In the unified algorithm, this is implemented by inserting:

\[x_{t_{\text{truncate}}}=\texttt{stop_grad}(x_{t_{\text{truncate}}})\]

at $t_{\text{truncate}}=K$, ensuring gradients only flow through the final segment.

DRaFT reports that truncation:

reduces compute by reducing the number of backward passes through the UNet,
and surprisingly improves training efficiency as well.

It also notes that for small $K$ (e.g., $K=1$), memory is small enough that checkpointing may not be needed.

12.2.2 DRaFT-LV: low-variance estimator for K=1

Empirically, DRaFT finds that $K=1$ offers the best reward-vs-compute tradeoff, but it comes with higher variance. DRaFT-LV reduces variance by generating multiple “nearby” training examples via forward diffusion of the generated image and averaging reward gradients. In Algorithm 1, after computing $g=\nabla_\theta r(x_0,c)$, DRaFT-LV:

noises the generated $x_0$ to $x_1$ multiple times,
re-runs the final-step denoising prediction,
and accumulates gradients across these samples.

This is a clean example of a general design principle: When truncation reduces compute, variance often rises — and we can trade extra cheap augmentations for variance reduction.

12.3 AlignProp: randomized TBTT to mitigate truncation bias and collapse

AlignProp reports a strong failure mode for full BPTT: mode collapse within two epochs, where the model generates the same image regardless of the prompt.

To address this, it uses randomized truncated backpropagation through time, motivated by the idea that fixed truncation length $K$ biases gradients toward short-range dependencies, and randomizing truncation length can mitigate this bias. AlignProp explicitly states that:

setting $K\sim\mathrm{Uniform}(0,50)$ gives the most promising results in human evaluation,
and the randomized TBTT gradient has the form:
\[\nabla_\theta L_{\text{TBTT-align}}= \frac{\partial L_{\text{align}}}{\partial\theta} + \sum_{k=0}^{K} \frac{\partial L_{\text{align}}}{\partial x_k} \frac{\partial x_k}{\partial\theta}.\]

In short:

DRaFT-K: fixed $K$.
AlignProp: randomize $K$ to reduce truncation bias and improve stability.

AlignProp also emphasizes memory reduction via LoRA + checkpointing, reporting dramatic memory reductions that enable backpropagation through longer chains.

13. Adjoint Method

Part IV — Online RL / Policy-Gradient Style Optimization

In Route C, we introduced the key motivation for policy-gradient-style alignment: the reward can be non-differentiable, noisy, or even provided by humans, so we cannot backpropagate through the reward model. Instead, we treat the diffusion/flow sampler as a trajectory policy and use score-function estimators (REINFORCE / PPO-style surrogates) to update the generator.

This chapter focuses on a practical question: If diffusion sampling is an “episode” (trajectory) and each denoising step is a “decision”, how do we (i) define the policy log-probability, (ii) compute a stable policy gradient update, and (iii) control distribution shift and reward over-optimization?

We answer these by first summarizing the shared template of this family, and then deep-diving into two representative algorithms: DDPO and DPOK.

14. Shared Template: Diffusion as a Stochastic Policy Over Denoising Trajectories

At a high level, both DDPO and DPOK follow the same abstraction:

A rollout is the reverse trajectory
\[\tau = (x_T, x_{T-1}, \ldots, x_0),\qquad x_T\sim \mathcal{N}(0,I).\]
The diffusion model defines a Markov policy over transitions
\[\pi_\theta(x_{t-1}\mid x_t,c) \equiv p_\theta(x_{t-1}\mid x_t,c),\]
where in DDPM-style samplers this transition is typically Gaussian with fixed variance schedule.
The reward is terminal-only (most commonly):
\[R(\tau,c) = R(x_0,c),\]
which makes credit assignment across many denoising steps non-trivial.

Under this view, the objective is the RL-style preference-alignment template already introduced : maximize expected reward while controlling the deviation from a reference/pretrained model.

Score-function gradient (REINFORCE form). With black-box reward, we use the likelihood-ratio trick:

\[\nabla_\theta \mathbb{E}_{\tau\sim p_\theta(\tau|c)}[R(x_0,c)] = \mathbb{E}_{\tau\sim p_\theta(\tau|c)}\Big[ R(x_0,c)\,\nabla_\theta \log p_\theta(\tau|c) \Big].\]

For a Markov chain,

\[\begin{align} \log p_\theta(\tau|c) & = \sum_{t=1}^{T}\log p_\theta(x_{t-1}\mid x_t,c) \\[10pt] \quad\Rightarrow\quad \nabla_\theta \log p_\theta(\tau|c) & = \sum_{t=1}^{T}\nabla_\theta \log p_\theta(x_{t-1}\mid x_t,c). \end{align}\]

This gives the canonical estimator:

\[g_{\text{PG}}(\theta) = \mathbb{E}_{\tau}\Big[ \Big(R(x_0,c)-b(c)\Big)\sum_{t=1}^{T}\nabla_\theta \log p_\theta(x_{t-1}\mid x_t,c) \Big],\]

where $b(c)$ is a baseline (often prompt-dependent) to reduce variance.

Why diffusion PG is unusually high-variance. Compared to classic RL with dense rewards, diffusion alignment faces two variance amplifiers:

1) Terminal reward only: the same scalar $R(x_0,c)$ is credited to all $T$ denoising steps.

2) Long horizon: even with $T=50$ steps, the gradient accumulates through dozens of transition logprobs.

As a result, practical systems rely heavily on:

reward normalization / baselines,
conservative learning rates and gradient clipping,
trust-region constraints (PPO clipping / KL penalties),
and carefully designed “training-time sampling” consistent with inference (e.g., CFG).

15. DDPO: PPO-style Policy Optimization for Diffusion Models

DDPO (“Training Diffusion Models with Reinforcement Learning”) fine-tunes a pretrained text-to-image diffusion model using a scalar reward such as ImageReward or aesthetic predictors. The key idea is to treat the denoising Markov chain as a stochastic policy and apply policy gradient updates. DDPO provides two closely related variants:

DDPOSF: a simple score-function policy gradient estimator (REINFORCE-style).
DDPOIS: an importance-sampling variant with a PPO-style clipped surrogate to enable multiple updates per rollout batch.

We present them in a unified lens below.

15.1 DDPOSF: On-policy Score-Function (REINFORCE-style)

DDPOSF uses the direct estimator:

\[g_{\text{SF}}(\theta) = \mathbb{E}_{\tau\sim p_\theta(\cdot|c)}\Big[ \hat{A}(x_0,c)\sum_{t=1}^{T}\nabla_\theta \log p_\theta(x_{t-1}\mid x_t,c) \Big],\]

where $\hat{A}$ is a normalized reward (an “advantage-like” signal). In practice:

Prompt-wise reward normalization is critical: maintain a running mean/std for each prompt $c$, and use
\[\hat{A} = (R-\mu_c)/(\sigma_c+\epsilon)\]
This plays the role of a prompt-conditional baseline.
Since the estimator is on-policy, DDPO typically performs one update per rollout batch to avoid drifting too far from the sampling policy.

15.2 DDPOIS: Importance Sampling + PPO-style Clipping

To improve sample efficiency, DDPOIS reuses the same rollout batch for multiple gradient steps. This requires correcting for policy drift with an importance ratio.

Let $\theta_{\text{old}}$ denote the parameters used to generate the trajectories. Define a per-step ratio:

\[r_t(\theta) = \frac{p_\theta(x_{t-1}\mid x_t,c)} {p_{\theta_{\text{old}}}(x_{t-1}\mid x_t,c)} = \exp\Big( \log p_\theta(x_{t-1}\mid x_t,c)-\log p_{\theta_{\text{old}}}(x_{t-1}\mid x_t,c) \Big).\]

A PPO-style clipped surrogate uses:

\[L^{\text{clip}}(\theta) = \mathbb{E}_{\tau\sim p_{\theta_{\text{old}}}} \Big[ \sum_{t=1}^{T} \min\big( r_t(\theta)\hat{A},\; \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A} \big) \Big].\]

Diffusion-specific note. DDPO reports that the clip range must be extremely small compared to standard RL tasks, because each denoising transition is a tiny stochastic update and the product of many ratios can become fragile. Empirically, a very small $\epsilon$ is used.

16. DPOK: REINFORCE-style Diffusion Policy Optimization with KL Regularization

DPOK (“Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models”) is developed concurrently with DDPO and shares the same “diffusion-as-policy” framing. The main conceptual emphasis of DPOK is: preference alignment should be KL-regularized toward the pretrained model to avoid overfitting the reward model.

DPOK constructs an explicit MDP and derives a REINFORCE gradient, then introduces a tractable surrogate for KL regularization in diffusion models.

16.1 MDP Formulation and REINFORCE Gradient

DPOK models DDPM denoising as a $T$-horizon MDP:

state $s_t = (z, x_{T-t})$,
action $a_t = x_{T-t-1}$,
deterministic transition to the next state identified by the action,
reward is zero except at the terminal step where $r(x_0,z)$ is evaluated.

The resulting objective is to maximize expected reward over prompts, and the gradient takes the classic REINFORCE form:

\[\nabla_\theta \mathbb{E}_{z}\mathbb{E}_{x_0\sim p_\theta(\cdot|z)}[-r(x_0,z)] = \mathbb{E}_{z}\mathbb{E}_{\tau\sim p_\theta(\cdot|z)} \Big[ -r(x_0,z)\sum_{t=1}^{T}\nabla_\theta \log p_\theta(x_{t-1}\mid x_t,z) \Big].\]

A notable modeling choice in DPOK is to keep the diffusion covariance fixed and only train the mean (\mu_\theta), which provides a natural stochastic policy for exploration.

16.2 KL Regularization via a Tractable Upper Bound

In text-to-image diffusion, the KL between final-image distributions

\[\mathrm{KL}(p_\theta(x_0|z)\,\|\,p_{\text{pre}}(x_0|z))\]

is not directly tractable because $p_\theta(x_0\mid z)$ is a marginal.

DPOK’s key lemma states that, under Markov chains starting from the same $x_T\sim \mathcal{N}(0,I)$, the final KL is upper bounded by a sum of stepwise transition KLs:

\[\mathrm{KL}\big(p_\theta(x_0|z)\,\|\,p_{\text{pre}}(x_0|z)\big) \;\le\; \sum_{t=1}^{T}\mathbb{E}_{x_t\sim p_\theta(\cdot|z)} \Big[ \mathrm{KL}\big( p_\theta(x_{t-1}\mid x_t,z)\,\|\,p_{\text{pre}}(x_{t-1}\mid x_t,z) \big) \Big].\]

This upper bound yields a tractable KL-regularized objective that looks like “RLHF for diffusion”: reward maximization minus a weighted sum of per-step KLs.

17. Flow-GRPO

18. Step-wise GRPO

Part V — Direct Preference Optimization (RL-Free Style)

This chapter focuses on RL-free preference alignment for diffusion / flow-matching generators: methods that learn from preference data without training an explicit reward model and without running on-policy RL (e.g., PPO). The core idea mirrors DPO for LLMs: rewrite the (unknown) reward in terms of log-probability ratios between the fine-tuned model and a fixed reference model, then optimize a simple logistic loss on preference pairs.

19. A Shared Template for DPO-Style Diffusion Alignment

We assume a static preference dataset

\[\mathcal{D}=\{(c,\,x^{\mathrm{w}}_0,\,x^{\mathrm{l}}_0)\},\qquad x^{\mathrm{w}}_0 \succ x^{\mathrm{l}}_0\mid c,\]

where $c$ is the conditioning (prompt / class / control), and $x_0^{\mathrm{w}}$ (“winner”) is preferred over $x_0^{\mathrm{l}}$ (“loser”). A reference generator $p_{\mathrm{ref}}(x\mid c)$ (typically the pre-trained checkpoint) is kept fixed to act as a trust region. Many alignment objectives can be written as a KL-regularized RL problem:

\[\max_{p}\;\mathbb{E}_{x\sim p(\cdot\mid c)}[r(c,x)]\; -\; \beta\,D_{\mathrm{KL}}\big(p(\cdot\mid c)\,\|\,p_{\mathrm{ref}}(\cdot\mid c)\big).\]

The optimal distribution has the exponential-tilting form

\[p^*(x\mid c)\propto p_{\mathrm{ref}}(x\mid c)\,\exp\Big(\tfrac{1}{\beta}r(c,x)\Big),\]

so the (unknown) reward can be rewritten (up to a constant in $c$) as

\[r(c,x)=\beta\,\log\frac{p^*(x\mid c)}{p_{\mathrm{ref}}(x\mid c)} + \mathrm{const}(c).\]

Assuming a Bradley–Terry preference model,

\[\Pr(x^{\mathrm{w}}\succ x^{\mathrm{l}}\mid c)\;=\;\sigma\Big(r(c,x^{\mathrm{w}})-r(c,x^{\mathrm{l}})\Big),\]

and substituting the reward identity above yields the standard DPO logistic loss:

\[\mathcal{L}_{\mathrm{DPO}}(\theta)=\mathbb{E}_{(c,x_0^{\mathrm{w}},x_0^{\mathrm{l}})\sim\mathcal{D}}\Big[-\log\sigma\big(\beta\,\Delta_{\theta}(c;x_0^{\mathrm{w}},x_0^{\mathrm{l}})\big)\Big],\]

where the “log-odds gap” is

\[\Delta_{\theta}(c;x_0^{\mathrm{w}},x_0^{\mathrm{l}}) \;=\;\log \frac{p_{\theta}(x_0^{\mathrm{w}}\mid c)}{p_{\text{ref}}(x_0^{\mathrm{w}}\mid c)} \; -\;\log \frac{p_{\theta}(x_0^{\mathrm{l}}\mid c)}{p_{\text{ref}}(x_0^{\mathrm{l}}\mid c)}\]

So: pairwise preference optimization becomes supervised learning on relative log-likelihood.

The diffusion-specific bottleneck: “what is $\log p_{\theta}(x_0\mid c)$?” For diffusion / flow models, the marginal likelihood $\log p_{\theta}(x_0\mid c)$ is not directly tractable, because sampling runs through a long latent trajectory.

\[\pi_\theta(x_0\mid c)= \int p(x_T)\prod_{t=T}^{1}\pi_\theta(x_{t-1}\mid x_t,c)\,dx_{1:T}.\]

DPO-style diffusion mostly differ in how they define / approximate the required log-probability ratios:

Path-space likelihoods on $x_{0:T}$ (then upper-bound / sample a random step).
Stepwise transition likelihoods $\log p_{\theta}(x_{t-1}\mid x_t,c)$ (Gaussian with known variance, mean predicted by the network).
Score-based ratio surrogates that avoid likelihoods altogether (use $\nabla_x\log p_t(x)$).

20. Diffusion-DPO: Pairwise DPO on Diffusion Likelihoods

Diffusion-DPO¹⁰ adapts the DPO objective to diffusion models by making the log-probability ratio tractable at training time.

At the top level, the loss follows DPO:

\[\mathcal{L}_{\mathrm{DPO}}(\theta)=-\mathbb{E}_{(c,x_0^{\mathrm{w}},x_0^{\mathrm{l}})\sim\mathcal{D}} \Big[\log\sigma\big(\beta\log\tfrac{p_{\theta}(x_0^{\mathrm{w}}\mid c)}{p_{\mathrm{ref}}(x_0^{\mathrm{w}}\mid c)}-\beta\log\tfrac{p_{\theta}(x_0^{\mathrm{l}}\mid c)}{p_{\mathrm{ref}}(x_0^{\mathrm{l}}\mid c)}\big)\Big].\]

This is the same structure as LLM-DPO, but the challenge is evaluating $\log p_{\theta}(x_0\mid c)$.

Diffusion-DPO formulates the KL-regularized objective on the reverse-path distribution $p_{\theta}(x_{0:T}\mid c)$, and derives a DPO-style loss on trajectory ratios (their Eq. (11)–(12)). The key practical step is to upper bound the full-path objective using a single random timestep $t$:

\[\log\frac{p_{\theta}(x_{0:T}\mid c)}{p_{\mathrm{ref}}(x_{0:T}\mid c)} \approx \sum_{t=1}^T\,\log\frac{p_{\theta}(x_{t-1}\mid x_t,c)}{p_{\mathrm{ref}}(x_{t-1}\mid x_t,c)}.\]

Because each reverse transition is Gaussian

\[p_{\theta}(x_{t-1}\mid x_t,c)=\mathcal{N}\big(x_{t-1};\mu_{\theta}(x_t,t,c),\sigma_t^2 I\big),\]

the log-ratio is computable via the predicted mean $\mu_{\theta}$. This yields a tractable DPO bound.

Even with stepwise likelihoods, naively we would still need samples from the reverse joint. Diffusion-DPO makes training efficient by drawing noisy states using the known forward noising:

\[x_t \sim q(x_t\mid x_0),\]

and (optionally) the forward posterior for $x_{t-1}$. With this trick, a training iteration only needs:

1) a preference pair $(x_0^{\mathrm{w}},x_0^{\mathrm{l}})$,
2) one random timestep $t$,
3) one network evaluation to get $\mu_{\theta}(x_t,t,c)$ for each branch.

21. D3PO: Pairwise DPO on Denoising Policies

D3PO¹³ targets the same end goal—learn from final-image preferences without a reward model—but starts from a different viewpoint: the denoising procedure is a multi-step MDP, and the diffusion model is a policy over next denoised states.

D3PO defines a mapping from diffusion to MDP (their Section 4.1):

state $s_t = (c, t, x_{T-t})$
action $a_t = x_{T-1-t}$
policy $\pi_{\theta}(a_t\mid s_t)\equiv p_{\theta}(x_{T-1-t}\mid x_{T-t},c,t)$

and then applies DPO-style reasoning to the KL-regularized RL objective.

With a state-conditional KL penalty, their optimal policy satisfies

\[\pi^*(a\mid s)=\pi_{\mathrm{ref}}(a\mid s)\exp\Big(\tfrac{1}{\beta}Q^*(s,a)\Big),\]

\[Q^*(s,a)=\beta\log\frac{\pi^*(a\mid s)}{\pi_{\mathrm{ref}}(a\mid s)}.\]

This is the MDP analogue of the DPO reward-log-ratio identity. A direct application of pairwise DPO to a full denoising trajectory would require storing and backpropagating through long segments—prohibitively expensive. D3PO first derives a DPO-style loss at a segment start state $s_k$ , using only the policy probabilities at that step:

\[\mathcal{L}(\theta)=-\mathbb{E}\Big[\log\sigma\big(\beta\log\tfrac{\pi_{\theta}(a_k^{\mathrm{w}}\mid s_k^{\mathrm{w}})}{\pi_{\mathrm{ref}}(a_k^{\mathrm{w}}\mid s_k^{\mathrm{w}})}-\beta\log\tfrac{\pi_{\theta}(a_k^{\mathrm{l}}\mid s_k^{\mathrm{l}})}{\pi_{\mathrm{ref}}(a_k^{\mathrm{l}}\mid s_k^{\mathrm{l}})}\big)\Big].\]

But using only $k=0$ “wastes” most of the trajectory. Their key assumption is: if the final image of a denoising segment is preferred, then every intermediate state-action pair in that segment is also preferable. So they construct T sub-segments

\[\sigma_i = \{(s_i,a_i),(s_{i+1},a_{i+1}),\ldots,(s_{T-1},a_{T-1})\},\qquad 0\le i\le T-1,\]

and apply the same preference label to each sub-segment (their Eq. (13)). In practice you can sample a subset of indices $i$ per update to control compute.

References

Nichol A, Dhariwal P, Ramesh A, et al. Glide: Towards photorealistic image generation and editing with text-guided diffusion models[J]. arXiv preprint arXiv:2112.10741, 2021. ↩
Prabhudesai M, Goyal A, Pathak D, et al. Aligning text-to-image diffusion models with reward backpropagation[J]. 2023. ↩
Clark K, Vicol P, Swersky K, et al. Directly Fine-Tuning Diffusion Models on Differentiable Rewards[J]. arXiv preprint arXiv:2309.17400, 2023. ↩
So O, Karrer B, Fan C, et al. Discrete Adjoint Matching[J]. arXiv preprint arXiv:2602.07132, 2026. ↩
Domingo-Enrich C, Drozdzal M, Karrer B, et al. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control[J]. arXiv preprint arXiv:2409.08861, 2024. ↩
Black K, Janner M, Du Y, et al. Training Diffusion Models with Reinforcement Learning[J]. arXiv preprint arXiv:2305.13301, 2023. ([arXiv][1]) ↩
Fan Y, Watkins O, Du Y, et al. DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models[J]. arXiv preprint arXiv:2305.16381, 2023. ↩
Liu J, Liu G, Liang J, et al. Flow-grpo: Training flow matching models via online rl[J]. arXiv preprint arXiv:2505.05470, 2025. ↩
Deng H, Yan K, Mao C, et al. Densegrpo: From sparse to dense reward for flow matching model alignment[J]. arXiv preprint arXiv:2601.20218, 2026. ↩
Wallace B, Dang M, Rafailov R, et al. Diffusion Model Alignment Using Direct Preference Optimization[C]. CVPR, 2024. (arXiv:2311.12908) ↩ ↩²
Li S, Kallidromitis K, Gokul A, et al. Aligning diffusion models by optimizing human utility[J]. Advances in Neural Information Processing Systems, 2024, 37: 24897-24925. ↩
Kim M, Lee Y, Kang S, et al. Preference alignment with flow matching[J]. Advances in Neural Information Processing Systems, 2024, 37: 35140-35164. ↩
Yang K, Tao J, Lyu J, et al. Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model[C]. CVPR, 2024. (arXiv:2311.13231) ↩

Share on

Twitter Facebook LinkedIn

Anbu Huang