Embedding Space Design For Diffusion and Flow Architecture

155 minute read

📅 Published: February 01, 2026

📘 TABLE OF CONTENTS

1. Embedding in Diffusion Generative Models
2. Spatial Position Embeddings and Injection
3. Position Embeddings for Multi-Resolution and Multi-Aspect-Ratio Generation
4. Time and Noise Embeddings in Diffusion and Flow Matching Models
5. Class Label Embeddings
6. Text Conditioning Embeddings
7. Image Embeddings in Diffusion and Flow Matching Generative Models
8. References

Embedding design is a core component of modern diffusion, flow matching, and flow-map generative models. Beyond backbone architecture, model performance strongly depends on how noisy states, temporal variables, conditions, and auxiliary metadata are represented and injected into the network. This article presents a unified view of embedding space design through the abstraction $f_{\theta}(x_t, t, s, c, m)$, and systematically reviews the major embedding families used in modern generative systems, including spatial position, time/noise, class label, text, image, and metadata embeddings. By organizing these techniques under a common representation-centric framework, we aim to clarify their roles in controllability, stability, and generalization across diverse generative architectures.

1. Embedding in Diffusion Generative Models

A convenient unified abstraction for diffusion models, flow matching, and flow-map-based generative models is:

\[f_{\theta}(x_t, t, s, c, m)\]

This notation represents the neural predictor / backbone model used in generation. where

$\theta$ represents model parameters;
$x_t$ represents the model input state at current time $t$;
$s$ represents the optional target time, where the model is trying to predict/bridge toward, this is especially important in two-time operator methods (flow map);
$c$ represents condition injection, Optional conditioning input that guides generation.
$m$ represents metadata / auxiliary inputs.

In modern diffusion / flow / flow-map generative models, all inputs $(x_t,t,s,c,m)$ are first transformed into suitable internal representations (embeddings, tokens, or feature maps), and the backbone (UNet or DiT) operates on those representations. We use this unified representation to discuss the design of the embedding space.

2. Spatial Position Embeddings and Injection

In diffusion and flow matching generative models, the forward process gradually corrupts a data sample $x_0$ into noise, and a neural network $\epsilon_\theta(x_t, t)$ or a velocity predictor $v_\theta(x_t, t)$ learns to reverse this process. When the denoising backbone is a Transformer — as is increasingly dominant since DiT ¹ — the model must process the noisy input $x_t$ as a sequence of patch tokens. Unlike convolutional U-Nets ², which maintain spatial structure through their architecture’s inductive biases (local receptive fields, spatial downsampling/upsampling paths), a vanilla Transformer is permutation-equivariant: without explicit position information, it cannot distinguish a top-left patch from a bottom-right one. The output would be invariant to any reordering of the input tokens, making it impossible to generate spatially coherent images or videos.

Spatial position embeddings resolve this by injecting positional identity into each token. Their design has far-reaching implications:

Spatial coherence: Position embeddings enable the model to learn that nearby patches should be locally consistent and that global structure (e.g., an object in the center, sky at the top) follows spatial conventions.
Multi-resolution and multi-aspect-ratio generation: Modern text-to-image and text-to-video systems (e.g., SD3 ³, FLUX ⁴, HunyuanDiT ⁵) must generate outputs at arbitrary resolutions and aspect ratios. The position embedding scheme directly determines how well the model generalizes across these settings.
Length extrapolation: At inference time, a model may be asked to generate images at resolutions never seen during training. The position embedding’s extrapolation behavior — whether it degrades gracefully or catastrophically — is a critical factor.
Video generation: For video diffusion models, spatial position embeddings extend to the temporal dimension, requiring 3D position encoding schemes that jointly capture spatial layout and temporal ordering ⁶ ⁷.

It is worth noting the distinction from timestep embeddings: the diffusion timestep $t$ encodes where in the denoising process the model is operating, typically injected via adaptive normalization (AdaLN) or similar mechanisms. Spatial position embeddings, by contrast, encode where in the spatial (or spatiotemporal) grid each token resides. This chapter focuses exclusively on the latter.

2.1 Representations of Position Embeddings: 1D, 2D, and 3D

The first design choice is how to represent spatial positions — specifically, the coordinate system and dimensionality used to index each token.

2.1.1 1D Position Embeddings

The simplest approach treats the 2D grid of patches as a flattened 1D sequence in raster-scan (row-major) order. For an image patchified into an $H’ \times W’$ grid (where $H’ = H/p$, $W’ = W/p$ for patch size $p$), each patch is assigned an integer index:

\[\mathrm{pos}(h, w) = h \cdot W' + w\]

where

\[h \in {0, \ldots, H' - 1}\quad \text{and} \quad w \in {0, \ldots, W' - 1}.\]

A position embedding $\mathbf{PE} \in \mathbb{R}^{(H’ \cdot W’) \times d}$ maps each index to a $d$-dimensional vector.

Usage in practice. DiT ¹, SiT ⁸, U-ViT ⁹, MDT ¹⁰, and PixArt-$\alpha$ ¹¹ all adopt 1D position embeddings, following the standard Vision Transformer (ViT) ¹² convention.

Advantages:

Simplicity: directly reuses the extensive ViT and NLP Transformer codebase.
Compatibility with standard 1D self-attention implementations.

Disadvantages:

Breaks 2D locality: Patches that are spatially adjacent vertically (e.g., positions $(h, W’ - 1)$ and $(h + 1, 0)$) receive distant 1D indices, while patches at the end of one row and the beginning of the next are far in 1D index but spatially close in only one dimension. The model must learn to undo this artificial 1D ordering.
Aspect ratio sensitivity: A $32 \times 64$ grid and a $64 \times 32$ grid have the same number of tokens but completely different 1D-to-2D mappings, making generalization across aspect ratios harder with fixed learned 1D embeddings.
Resolution coupling: The total sequence length $H’ \times W’$ changes with resolution, requiring interpolation or other adaptation for learned 1D embeddings.

In practice, deep Transformers can learn to recover 2D structure from 1D indices (as demonstrated by ViT’s strong performance), but this places an additional burden on the model and may require more data or capacity.

2.1.2 2D Position Embeddings

A more natural representation assigns each patch a 2D coordinate $(h, w)$ and constructs the position embedding as a function of both axes independently.

Factorized (axis-decomposed) representation. The most common approach computes separate embeddings along each axis and combines them:

\[\mathbf{PE}(h, w) = [\mathbf{PE}^{(H)}(h);\; \mathbf{PE}^{(W)}(w)] \in \mathbb{R}^d\]

where $\mathbf{PE}^{(H)}(h) \in \mathbb{R}^{d/2}$ encodes the row index and $\mathbf{PE}^{(W)}(w) \in \mathbb{R}^{d/2}$ encodes the column index, and $[\,;\,]$ denotes concatenation. Alternatively, the two components can be summed:

\[\mathbf{PE}(h, w) = \mathbf{PE}^{(H)}(h) + \mathbf{PE}^{(W)}(w)\]

Concatenation is more common in recent work because it avoids interference between the two axes.

Usage in practice. SD3/MMDiT ³ employs 2D sinusoidal position embeddings. NaViT ¹³ uses factorized learned 2D position embeddings. HDiT ¹⁴ uses axial position embeddings (a variant of factorized 2D). FiT ¹⁵, FLUX ⁴, and HunyuanDiT ⁵ use 2D RoPE, which inherently operates in 2D coordinates.

Advantages:

Preserves spatial topology: Patches that are close in 2D space receive similar position encodings.
Axis independence: Each axis can be handled independently, making it straightforward to adapt to different aspect ratios — changing the width does not affect height position encodings.
Better generalization: When resolution or aspect ratio changes, only the range of indices per axis changes, not the fundamental encoding scheme.

Disadvantages:

Slightly more complex implementation than 1D.
For learned embeddings, two separate embedding tables are needed, though total parameter count is actually lower
\[H'_{\max}\cdot d/2 + W'_{\max}\cdot d/2 \quad \text{vs.}\quad H'_{\max}\cdot W'_{\max}\cdot d\]

2.1.3 3D Position Embeddings

For video generation, the input $x_t$ consists of a sequence of frames, each patchified into a spatial grid. Each token is indexed by a 3D coordinate $(t, h, w)$ — temporal index, row, and column.

Factorized 3D representation:

\[\mathbf{PE}(t, h, w) = [\mathbf{PE}^{(T)}(t);\; \mathbf{PE}^{(H)}(h);\; \mathbf{PE}^{(W)}(w)] \in \mathbb{R}^d\]

where the embedding dimension is split across the three axes:

\[d = d_t + d_h + d_w\]

Dimension allocation. The partition is not necessarily uniform. Since spatial dimensions typically carry more information than the temporal dimension (images have richer spatial structure than frame-to-frame variation), some models allocate more dimensions to the spatial axes. For example, a model with $d = 3072$ might use $d_t = 768$, $d_h = 1152$, $d_w = 1152$, or simply $d_t = d_h = d_w = 1024$.

Usage in practice. CogVideoX ⁶ and HunyuanVideo ⁷ use 3D RoPE with factorized dimension splits. Latte ¹⁶ uses separate learned spatial and temporal position embeddings, adding spatial PE in spatial attention blocks and temporal PE in temporal attention blocks.

Temporal position considerations. Video models often treat the temporal axis differently from spatial axes because:

The number of frames $(T)$ is typically much smaller than the spatial grid size $(H’ \times W’)$.
Temporal relationships may benefit from different frequency scales.
Some models use temporal factorization (alternating spatial-only and temporal-only attention), where spatial and temporal PEs are applied independently ¹⁶.

2.1.4 Summary Comparison

Representation	Models Using It	Multi-Aspect-Ratio	Resolution Generalization	Implementation Complexity
1D (raster-scan)	DiT, SiT, U-ViT, PixArt-$\alpha$	Requires relearning or interpolation	Poor without interpolation	Lowest
2D (factorized)	SD3, FLUX, FiT, HunyuanDiT, NaViT	Natural per-axis adaptation	Good (each axis independent)	Moderate
3D (factorized)	CogVideoX, HunyuanVideo, Latte	Natural per-axis adaptation	Good	Moderate

2.2 Position Embedding Algorithms

Given a positional coordinate (whether 1D, 2D, or 3D), the next design choice is the algorithm used to map coordinates to embedding vectors. We survey the main approaches used in diffusion and flow matching models.

2.2.1 Sinusoidal (Fourier) Position Embeddings

Introduced in the original Transformer, sinusoidal position embeddings encode each position $m$ using a set of sinusoidal functions at different frequencies:

\[\mathrm{PE}(m, 2i) = \sin\left(\frac{m}{\tau^{2i/d}}\right), \qquad \mathrm{PE}(m, 2i+1) = \cos\left(\frac{m}{\tau^{2i/d}}\right)\]

where $i \in \{0, 1, \ldots, d/2 - 1\}$ indexes the dimension pairs and $\tau$ (typically $10000$) is a base frequency hyperparameter.

Geometric frequency spacing. The frequencies $\omega_i = 1/\tau^{2i/d}$ span a wide range: the lowest frequency has a very long wavelength (capturing global position), while the highest frequency oscillates rapidly (capturing fine-grained local position). This multi-scale encoding is analogous to Fourier features and enables the model to distinguish positions at multiple granularities.

2D extension. For 2D images, the standard approach (used in SD3 ³, PixArt-$\Sigma$ ¹⁷) is to compute independent sinusoidal embeddings for each axis and concatenate:

\[\mathbf{PE}_{2D}(h, w) = [\mathrm{SinEmb}(h, d/2);\; \mathrm{SinEmb}(w, d/2)]\]

where $\mathrm{SinEmb}(m, d’)$ is the $d’$-dimensional sinusoidal embedding of integer position $m$.

Properties:

Deterministic and parameter-free: No learnable parameters.
Theoretically supports arbitrary positions: Since the embedding is a continuous function, it can be evaluated at any (even non-integer) position, which is useful for interpolation.
Bounded: The embeddings lie on a unit hypersphere (each dimension is in $[-1, 1]$).
Relative position structure: The dot product $\mathbf{PE}(m)^T \mathbf{PE}(n)$ is a function of $m - n$, providing implicit relative position information.

Limitations:

Fixed encoding without adaptation to the data distribution.
Extrapolation to positions significantly beyond the training range can cause the model to encounter unseen frequency patterns.

2.2.2 Learned (Absolute) Position Embeddings

Instead of using a fixed function, learned position embeddings maintain a lookup table $\mathbf{E} \in \mathbb{R}^{N_{\max} \times d}$ where $N_{\max}$ is the maximum number of positions, and each row is learned via backpropagation.

Usage in practice. DiT ¹, U-ViT ⁹, PixArt-$\alpha$ ¹¹, and MDT ¹⁰ all use learned absolute position embeddings, following ViT ¹².

Properties:

Data-adaptive: Can learn position patterns specific to the training distribution.
Simple implementation: Just an nn.Embedding lookup.

Limitations:

Fixed vocabulary: Cannot represent positions beyond $N_{\max}$ without interpolation or retraining.
No structural prior: The relationship between nearby positions must be learned entirely from data — there is no built-in notion that position 5 is “close to” position 6.
Resolution changes require interpolation: Moving to a different resolution requires resizing the embedding table, typically via bilinear or bicubic interpolation of 2D-reshaped embeddings.

2D factorized learned embeddings. NaViT ¹³ uses separate learned embedding tables for height and width:

\[\mathbf{E}^{(H)} \in \mathbb{R}^{H'_{\max} \times d/2} \qquad \text{and} \qquad \mathbf{E}^{(W)} \in \mathbb{R}^{W'_{\max} \times d/2}\]

This is much more parameter-efficient $(O(H'_{\max} + W'_{\max})\ \text{vs.}\ O(H'_{\max}\cdot W'_{\max}))$ and naturally supports varying aspect ratios.

2.2.3 Rotary Position Embeddings (RoPE)

Rotary Position Embedding (RoPE) [^rope] has become the dominant position encoding in recent large-scale diffusion Transformers. Unlike additive approaches, RoPE encodes position by rotating query and key vectors in attention, making the dot-product attention inherently dependent on relative positions.

1D RoPE. For a vector $\mathbf{x} \in \mathbb{R}^d$ at position $m$, RoPE applies a block-diagonal rotation:

\[f(\mathbf{x}, m) = \mathbf{R}_m \mathbf{x}\]

where $\mathbf{R}_m$ is constructed from $d/2$ rotation matrices applied to consecutive pairs of dimensions:

\[\small \begin{pmatrix} \cos(m\theta_1) & -\sin(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ \sin(m\theta_1) & \cos(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos(m\theta_2) & -\sin(m\theta_2) & \cdots & 0 & 0 \\ 0 & 0 & \sin(m\theta_2) & \cos(m\theta_2) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{d/2}) & -\sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{d/2}) & \cos(m\theta_{d/2}) \end{pmatrix}\]

with frequencies

\[\theta_i = \tau^{-2(i-1)/d}\]

typically $\tau = 10000$.

Key property (relative position). When computing attention between query at position $m$ and key at position $n$:

\[\langle f(\mathbf{q}, m), f(\mathbf{k}, n)\rangle = \mathbf{q}^T \mathbf{R}_m^T \mathbf{R}_n \mathbf{k} = \mathbf{q}^T \mathbf{R}_{n-m} \mathbf{k}\]

Since $\mathbf{R}_m$ is an orthogonal matrix, the product $\mathbf{R}_m^T \mathbf{R}n = \mathbf{R}{n-m}$ depends only on the relative displacement. This means the attention logit between two tokens is a function of their relative position, not absolute positions.

2D RoPE. For images, the head dimension $d_h$ is split into two halves, one for each spatial axis:

\[f_{2D}(\mathbf{x}, h, w) = \left[ \mathbf{R}^{(H)}_h \mathbf{x}_{[1:d_h/2]}; \; \mathbf{R}^{(W)}_w \mathbf{x}_{[d_h/2+1:d_h]} \right]\]

Each half receives rotations based on the corresponding axis position. The frequencies can be shared or independent across axes.

3D RoPE for video. The head dimension is split into three parts:

\[f_{3D}(\mathbf{x}, t, h, w) = \left[ \mathbf{R}^{(T)}_t \mathbf{x}^{(T)}; \; \mathbf{R}^{(H)}_h \mathbf{x}^{(H)}; \; \mathbf{R}^{(W)}_w \mathbf{x}^{(W)} \right]\]

This is used by CogVideoX ⁶ and HunyuanVideo ⁷.

Usage in diffusion models. RoPE has been adopted by FLUX ⁴, HunyuanDiT ⁵, FiT ¹⁵, Lumina-T2X ¹⁸, CogVideoX ⁶, and HunyuanVideo ⁷. Its rise in diffusion models mirrors its dominance in large language models (LLaMA, etc.).

Advantages of RoPE for diffusion models:

Persistent position information. RoPE is applied at every attention layer, not just once at the input. This ensures position information is refreshed throughout the network depth, unlike additive PE which may fade through residual connections and normalization.
Relative position encoding. The attention pattern naturally reflects spatial proximity, which aligns well with the local coherence needed in image generation.
Resolution flexibility. RoPE is a function of position indices, not a fixed-size lookup table. Any sequence length can be accommodated without resizing or interpolation of parameters.
Parameter-free. No additional learned parameters (though some variants learn the frequencies).

Limitations:

Applied only within attention (Q/K rotations); the FFN layers do not receive explicit position information.
Extrapolation to much longer sequences than training can degrade quality, though less severely than absolute PE.

2.2.4 Attention Bias Methods

An alternative family of methods encodes position as a bias added to attention logits, rather than modifying the token embeddings or Q/K vectors.

ALiBi (Attention with Linear Biases) ¹⁹ adds a position-dependent penalty to attention scores:

\[\mathrm{Attention}_{ij} = \frac{\mathbf{q}_i^T \mathbf{k}_j}{\sqrt{d_k}} - \lambda \cdot |i - j|\]

where $\lambda$ is a head-specific slope. This linearly penalizes attention to distant tokens, encouraging local attention patterns. Different attention heads use different slopes, providing multi-scale locality.

Relative position biases. Models like Swin Transformer use learned relative position bias tables $B_{h_i-h_j, w_i-w_j}$ added to attention logits. This can be extended to diffusion models, though it is less common in practice compared to RoPE.

Usage in diffusion models. Attention bias methods are less prevalent in diffusion Transformers compared to RoPE or sinusoidal embeddings. However, some architectures incorporate attention biases as auxiliary position signals, particularly when combining local and global attention patterns.

2.2.5 Convolutional Position Encoding (CPE)

Convolutional position encoding, also known as implicit position encoding (NoPE) provides implicit position information through depth-wise convolutions applied to the token sequence after reshaping to 2D:

\[\mathbf{h}' = \mathbf{h} + \mathrm{DWConv}_{k \times k}(\mathrm{Reshape}_{H' \times W'}(\mathbf{h}))\]

Since convolutions are spatially local and weight-sharing, they can implicitly encode position through boundary effects and local neighborhood structure.

Properties:

Resolution-agnostic: Convolutional kernels work at any spatial resolution without modification.
Learnable: The convolution weights are trained end-to-end.
Complementary: CPE can be used alongside explicit position embeddings to provide additional position information in the FFN pathway.

Some DiT variants and hybrid architectures incorporate CPE as a supplement to RoPE or sinusoidal PE, particularly to provide position information in parts of the network (e.g., feed-forward layers) where RoPE does not operate.

2.2.6 Algorithm Comparison

Algorithm	Type	Parameters	Resolution Flexibility	Position Type	Applied At
Sinusoidal	Fixed function	0	Moderate (functional)	Absolute	Input only
Learned	Lookup table	$N_{\max} \times d$	Poor (needs interpolation)	Absolute	Input only
RoPE	Rotation	0 (or learned $\theta$)	High (functional)	Relative	Every attention layer
ALiBi	Bias	0 (or $H$ slopes)	High	Relative	Every attention layer
CPE	Convolution	$k^2 \cdot d$	High (conv kernel)	Implicit	Selectable layers

2.3 Injection Mechanisms: How Position Embeddings Enter the Model

Beyond the choice of PE algorithm, the injection mechanism — how and where position information is introduced into the Transformer — significantly affects model behavior.

2.3.1 Additive Injection at Input

The most straightforward approach adds position embeddings to the patch token embeddings once, before the first Transformer block:

\[\mathbf{z}_i^{(0)} = \mathrm{PatchEmbed}(x_t^{(i)}) + \mathbf{PE}_i\]

where $\mathrm{PatchEmbed}(\cdot)$ is the linear projection (or convolution) that maps each image patch to a $d$-dimensional token, and $\mathbf{PE}_i$ is the position embedding for the $i$-th token (from sinusoidal or learned encoding).

Used by: DiT ¹, U-ViT ⁹, PixArt-$\alpha$ ¹¹, MDT ¹⁰, SD3 ³ (sinusoidal 2D).

Characteristics:

Position information is injected once and must survive through all $L$ Transformer layers via residual connections.
In deep models, the position signal can become attenuated as it is mixed with content features through layer normalization and non-linearities.
All components of the Transformer (attention Q/K/V, FFN) can access position information.

2.3.2 Rotary Injection within Attention (RoPE)

RoPE takes a fundamentally different approach: position is injected inside every attention layer by rotating the query and key vectors before computing dot-product attention:

\[\mathbf{q}_i^{(l)} = \mathbf{R}_{m_i} \cdot W_Q^{(l)} \mathbf{h}_i^{(l)}, \qquad \mathbf{k}_j^{(l)} = \mathbf{R}_{m_j} \cdot W_K^{(l)} \mathbf{h}_j^{(l)}\] \[\mathrm{Attn}^{(l)}(\mathbf{q}, \mathbf{k}, \mathbf{v})= \mathrm{softmax}\left(\frac{(\mathbf{q}^{(l)})^T \mathbf{k}^{(l)}}{\sqrt{d_k}}\right)\mathbf{v}^{(l)}\]

Note that the value vectors $\mathbf{v}$ are not rotated. This means position information influences which tokens attend to each other (via Q/K) but not what information is aggregated (via V).

Used by: FLUX ⁴, HunyuanDiT ⁵, FiT ¹⁵, Lumina-T2X ¹⁸, CogVideoX ⁶, HunyuanVideo ⁷.

Characteristics:

Position information is refreshed at every layer, preventing degradation with depth.
FFN layers do not receive explicit position information (they operate on position-unaware representations from the attention output).
The value vectors are position-free, meaning the content aggregated through attention is independent of absolute position — only the attention routing is position-aware.

2.3.3 Attention Bias Injection

Position information is injected as an additive bias to the attention logit matrix:

\[\mathrm{Attn}(\mathbf{q}, \mathbf{k}, \mathbf{v}) = \mathrm{softmax}\left(\frac{\mathbf{q}\mathbf{k}^T}{\sqrt{d_k}} + \mathbf{B}\right)\mathbf{v}\]

where $\mathbf{B} \in \mathbb{R}^{N \times N}$ encodes pairwise position relationships (e.g., relative distance).

Used by: ALiBi ¹⁹, relative position bias in Swin Transformer and its diffusion adaptations.

Characteristics:

Directly modulates attention patterns — nearby tokens can be given higher (or lower) attention weights.
Applied at every layer (like RoPE).
The bias can encode complex spatial relationships (e.g., different biases for horizontal vs. vertical neighbors).
Requires computing or storing the full $N \times N$ bias matrix, which can be memory-intensive for high-resolution images.

2.3.4 Hybrid Injection

Many state-of-the-art models combine multiple injection mechanisms:

RoPE + CPE: Use RoPE in attention for position-aware routing, and CPE (depth-wise convolution) in or between FFN layers to provide position information to the feed-forward pathway. This addresses RoPE’s limitation of not providing position info to FFN layers.
Additive PE + RoPE: Some models add sinusoidal PE at the input (to provide global position information to all components) and additionally apply RoPE in attention layers (to provide layer-wise relative position). However, this combination is less common, as RoPE alone often suffices.
Additive PE + Attention Bias: Combining input-level additive PE with per-layer attention biases.

The trend in 2024–2025 has been toward RoPE as the primary (and often sole) position encoding, sometimes supplemented by CPE for multi-resolution robustness.

2.3.5 Summary: Input-Level vs. Layer-Level Injection

Mechanism	Where Applied	Frequency	Position Info in FFN?	Depth Decay?
Additive (input)	Before layer 1	Once	Yes (via residual)	Yes
RoPE (Q/K rotation)	Every attention layer	Per layer	No (only in attention)	No
Attention bias	Every attention layer	Per layer	No	No
CPE (convolution)	Selected layers	Configurable	Yes	Configurable

3. Position Embeddings for Multi-Resolution and Multi-Aspect-Ratio Generation

Real-world applications demand generating images and videos at diverse resolutions (e.g., 512 × 512, 768 × 1024, 1920 × 1080) and aspect ratios (1:1, 4:3, 16:9, 9:16). This section surveys how position embeddings adapt to these requirements — one of the most practically consequential design decisions.

3.1 The Core Challenge

Consider a model trained on 256 × 256 images with patch size $p = 16$, yielding a 16 × 16 patch grid (256 tokens). At inference, we may want to generate 512 × 768 images, yielding a 32 × 48 grid (1536 tokens). Several problems arise:

Position range: The position indices now extend to (31, 47) — far beyond the training range of (15, 15).
Sequence length: The attention mechanism must handle 6× more tokens.
Aspect ratio: The spatial proportions have changed from 1:1 to 2:3.

The position embedding must either (a) extrapolate gracefully or (b) be fundamentally resolution-invariant.

3.2 Interpolation of Absolute Position Embeddings

When using learned absolute PE — and, in some pipelines, absolute spatial grids that are later frequency-embedded — a common solution is spatial interpolation of the position embedding grid. In the broader ViT family, changing image resolution while keeping patch size fixed changes the token grid, so one must resize the position grid accordingly; this is exactly the reason interpolation became standard practice in vision Transformers.

Bicubic interpolation (ViT-style). If the model was trained with a 2D grid of position embeddings

\[\mathbf{PE} \in \mathbb{R}^{H'_{\text{train}} \times W'_{\text{train}} \times d},\]

we can resize it to

\[\mathbf{PE}' \in \mathbb{R}^{H'_{\text{new}} \times W'_{\text{new}} \times d}\]

by interpolation along the spatial dimensions. A convenient formalization is

\[\mathbf{PE}'(u,v,:) = \sum_{i=0}^{H'_{\text{train}}-1} \sum_{j=0}^{W'_{\text{train}}-1} \kappa\!\left(\frac{u}{s_h}-i\right) \kappa\!\left(\frac{v}{s_w}-j\right) \mathbf{PE}(i,j,:),\]

where

\[s_h=\frac{H'_{\text{new}}}{H'_{\text{train}}}, \qquad s_w=\frac{W'_{\text{new}}}{W'_{\text{train}}},\]

and $\kappa(\cdot)$ is the interpolation kernel (in practice often bilinear or bicubic). When the original PE is stored as a flattened 1D table, one first reshapes it to a 2D grid, interpolates, and then flattens it back. This is the standard absolute-PE adaptation pattern behind early DiT-style resolution transfer.

If the original table is stored as

\[\mathbf{E}_{\text{flat}} \in \mathbb{R}^{(H'_{\text{train}}W'_{\text{train}})\times d},\]

then

\[\mathbf{PE}(i,j,:) = \mathbf{E}_{\text{flat}}[,iW'_{\text{train}}+j,],\]

and after interpolation the resized table is reflattened as

\[\mathbf{E}'_{\text{flat}}[,uW'_{\text{new}}+v,] = \mathbf{PE}'(u,v,:).\]

This approach is used by early DiT-based systems when adapting to new resolutions:

Train DiT at 256 × 256 (learned 1D PE for 16 × 16 = 256 tokens).
To generate at 512 × 512, reshape the 1D PE to 16 × 16 × $d$, interpolate to 32 × 32 × $d$, then flatten back to 1D.

Limitations:

Interpolation introduces artifacts, especially for large resolution changes.
Fine-tuning is often required after interpolation to recover quality.
The model has no awareness that interpolation occurred — the semantics of position embeddings may shift.

These limitations are precisely why later models increasingly moved away from learned absolute lookup tables toward functional encodings such as sinusoidal PE or RoPE.

# PyTorch-like pseudocode: resize a learned absolute PE table
import torch
import torch.nn.functional as F

def resize_abs_pos_embed(pos_embed, old_hw, new_hw):
    """
    pos_embed: [1, H_old*W_old, D]
    old_hw: (H_old, W_old)
    new_hw: (H_new, W_new)
    return: [1, H_new*W_new, D]
    """
    H_old, W_old = old_hw
    H_new, W_new = new_hw
    B, N, D = pos_embed.shape
    assert B == 1
    assert N == H_old * W_old

    # [1, N, D] -> [1, D, H_old, W_old]
    pe_2d = pos_embed.reshape(1, H_old, W_old, D).permute(0, 3, 1, 2)

    # common implementation: bicubic interpolation
    pe_2d = F.interpolate(
        pe_2d,
        size=(H_new, W_new),
        mode="bicubic",
        align_corners=False,
    )

    # [1, D, H_new, W_new] -> [1, H_new*W_new, D]
    pe_new = pe_2d.permute(0, 2, 3, 1).reshape(1, H_new * W_new, D)
    return pe_new

3.3 Frequency Scaling for Resolution Extrapolation

When the inference resolution significantly exceeds training resolution, RoPE’s position indices enter a range where the model has had little or no training signal. This is the exact analog of long-context extrapolation in RoPE-based LLMs. As a result, several techniques from the LLM literature — especially Position Interpolation (PI) and NTK-aware scaled RoPE — have been adapted or cited by diffusion papers such as FiT and Lumina-T2X / Next-DiT. Importantly, PI is a peer-reviewed method, while NTK-aware scaling is best viewed as a widely used community heuristic that has nevertheless been explicitly cited and adopted in later generative-model work. ([arXiv][8])

Position interpolation. Instead of using raw position indices, scale them back into the training range. In 1D, with training length $L_{\text{train}}$ and test length $L_{\text{new}}$,

\[m' = m \cdot \frac{L_{\text{train}}}{L_{\text{new}}}.\]

A more endpoint-preserving form sometimes used in vision is

\[m' = m \cdot \frac{L_{\text{train}}-1}{L_{\text{new}}-1}.\]

For 2D images, this becomes

\[h' = h \cdot \frac{H'_{\text{train}}-1}{H'_{\text{new}}-1}, \qquad w' = w \cdot \frac{W'_{\text{train}}-1}{W'_{\text{new}}-1}.\]

Then RoPE is evaluated at $h',w'$ rather than $h,w$. The effect is to keep all effective positions inside the range seen during training, at the cost of compressing the frequency spectrum. This is exactly the central idea of PI in LLMs, now reinterpreted axis-wise for images.

# PyTorch-like pseudocode: coordinate remapping / RoPE scaling for extrapolation
import torch

def position_interpolation_coords(H_new, W_new, H_train, W_train, device="cpu"):
    ys = torch.arange(H_new, device=device, dtype=torch.float32)
    xs = torch.arange(W_new, device=device, dtype=torch.float32)

    if H_new > 1:
        ys = ys * (H_train - 1) / (H_new - 1)
    if W_new > 1:
        xs = xs * (W_train - 1) / (W_new - 1)

    grid_y, grid_x = torch.meshgrid(ys, xs, indexing="ij")
    return grid_y.reshape(-1), grid_x.reshape(-1)

NTK-aware scaling. Rather than compressing positions uniformly, another common heuristic modifies the rotary base $\tau$. A frequently used form is

\[\tau' = \tau \cdot \alpha^{d/(d-2)}, \qquad \alpha = \frac{L_{\text{new}}}{L_{\text{train}}}.\]

Equivalently, one can think of this as modifying the inverse frequencies rather than the coordinates:

\[\omega_i' = \tau'^{-2i/d}.\]

This leaves token coordinates unchanged but stretches the usable frequency range. Lumina-T2X explicitly describes scaling the rotary base so that the lowest-frequency term behaves similarly to PI while higher-frequency terms transition more gradually, enabling tuning-free resolution extrapolation.

# PyTorch-like pseudocode: coordinate remapping / RoPE scaling for extrapolation
import torch



def ntk_scaled_base(base, L_train, L_new, dim):
    """
    Common NTK-aware scaling heuristic.
    """
    alpha = float(L_new) / float(L_train)
    return base * (alpha ** (dim / (dim - 2)))

Normalized coordinates. Another alternative is to normalize coordinates to a fixed range regardless of resolution:

\[\hat h = \frac{h}{H'-1}, \qquad \hat w = \frac{w}{W'-1}.\]

One can then either feed $\hat h,\hat w$ directly into a sinusoidal/Fourier map or convert them back to an equivalent rescaled position for RoPE. This ensures that the center is always near $(0.5,0.5)$ across resolutions. However, it sacrifices absolute scale information: position encoding alone no longer tells the model whether the current image is low- or high-resolution. This trade-off is consistent with NaViT’s discussion of fractional coordinates: they improve generalization to unseen sizes but can obscure aspect-ratio or scale information if not handled carefully.

# PyTorch-like pseudocode: coordinate remapping / RoPE scaling for extrapolation
import torch

def normalized_coords(H, W, device="cpu"):
    ys = torch.arange(H, device=device, dtype=torch.float32)
    xs = torch.arange(W, device=device, dtype=torch.float32)
    if H > 1:
        ys = ys / (H - 1)
    if W > 1:
        xs = xs / (W - 1)
    grid_y, grid_x = torch.meshgrid(ys, xs, indexing="ij")
    return grid_y.reshape(-1), grid_x.reshape(-1)

Resolution-dependent frequencies. Some modern generative systems go one step further and make the RoPE frequency base depend on the target resolution. Lumina-T2X / Next-DiT is the clearest example in the diffusion literature: it explicitly studies direct extrapolation versus NTK-aware scaled RoPE for higher-resolution inference.

3.4 Packing Strategies: NaViT

NaViT (Patch n’ Pack) ¹³ introduces a fundamentally different approach to multi-resolution training: packing multiple images of different sizes into a single sequence. The key idea is not to force all images to a common resolution, but to keep each image in its own native patch grid and concatenate multiple token sequences into one packed batch, with masks ensuring isolation between examples. The NaViT paper explicitly introduces self-attention masks and revisits positional embeddings through factorized and optionally fractional coordinate embeddings for arbitrary resolutions and aspect ratios.

Mechanism:

Each image is patchified independently, producing token sequences of different lengths.
Multiple images’ tokens are concatenated into one sequence.
Factorized 2D position embeddings are assigned per image: each image’s tokens receive position indices starting from $(0,0)$ up to their respective $\left(H'_i-1,W'_i-1\right)$.
Attention masking ensures each image’s tokens only attend to tokens from the same image.

Formally, suppose batch element $b$ contains $M_b$ images, and the $j$-th image contributes $N_{b,j}$ tokens. The packed sequence length is

\[N_b = \sum_{j=1}^{M_b} N_{b,j}.\]

Let token $u$ in the packed sequence have an image identity $g(u) \in {1,\dots,M_b}$. Then the block-diagonal attention mask is

\[\mathbf{M}_{uv}= \begin{cases} 0, & g(u)=g(v),\\[10pt] -\infty, & g(u)\neq g(v). \end{cases}\]

Thus attention remains intra-image, even though storage and batching are inter-image. For factorized positional embeddings, if token $u$ belongs to image $j$ and corresponds to grid location $\left(h_u,w_u\right)$ inside that image, then

\[\mathbf{PE}_u = \phi_h^{(j)}(h_u) + \phi_w^{(j)}(w_u)\]

or, in a concatenative form,

\[\mathbf{PE}_u = [\phi_h^{(j)}(h_u);\phi_w^{(j)}(w_u)].\]

NaViT also studies fractional coordinates, i.e.

\[\hat h_u=\frac{h_u}{H'_j-1}, \qquad \hat w_u=\frac{w_u}{W'_j-1},\]

to improve extrapolation to unseen resolutions, while noting that such normalization can blur explicit aspect-ratio information if used incautiously.

Advantages:

No padding waste: GPU utilization is high because the sequence is densely packed.
True multi-resolution training: Each image retains its native resolution without resizing.
Natural aspect ratio handling: Each image’s position indices reflect its actual shape.

Adaptation for diffusion models. NaViT was originally proposed for discriminative ViTs, but the same principle is relevant for diffusion backbones. FiT does not literally pack multiple unrelated images into one attention-connected sequence; instead, it uses one image per sequence plus padding and masking. Conceptually, however, both approaches exploit the same idea: let token length vary, and recover batching efficiency with masks rather than forcing a single spatial shape upfront.

# PyTorch-like pseudocode: NaViT-style packing
import torch

def pack_examples(token_list, coord_list):
    """
    token_list: list of [N_i, D]
    coord_list: list of tuples (h_i: [N_i], w_i: [N_i])
    returns:
        tokens: [N_total, D]
        coords_h: [N_total]
        coords_w: [N_total]
        group_ids: [N_total]
        attn_mask: [N_total, N_total]
    """
    tokens = []
    coords_h = []
    coords_w = []
    group_ids = []

    for gid, (tok, (h, w)) in enumerate(zip(token_list, coord_list)):
        tokens.append(tok)
        coords_h.append(h)
        coords_w.append(w)
        group_ids.append(torch.full((tok.shape[0],), gid, dtype=torch.long, device=tok.device))

    tokens = torch.cat(tokens, dim=0)
    coords_h = torch.cat(coords_h, dim=0)
    coords_w = torch.cat(coords_w, dim=0)
    group_ids = torch.cat(group_ids, dim=0)

    same_group = group_ids[:, None] == group_ids[None, :]
    attn_mask = torch.where(
        same_group,
        torch.zeros_like(same_group, dtype=torch.float32),
        torch.full_like(same_group, float("-inf"), dtype=torch.float32),
    )
    return tokens, coords_h, coords_w, group_ids, attn_mask

3.5 Progressive Resolution Training

Several large-scale models use progressive training — starting at low resolution and gradually increasing. This idea is especially natural in high-resolution diffusion because both optimization stability and compute cost improve when the early stage uses shorter token sequences. The strategy is explicit in systems such as PixArt-Σ (weak-to-strong / coarse-to-fine scaling to 4K) and appears again in curriculum-style video training such as HunyuanVideo.

Let stages be indexed by $s=1,\dots,S$ with target patch grids

\[(H'_1,W'_1),\quad (H'_2,W'_2),\quad \dots,\quad (H'_S,W'_S),\]

and satisfied:

\[H'_1W'_1 < H'_2W'_2 < \cdots < H'_S W'_S.\]

At each stage, the model parameters are warm-started from the previous stage:

\[\theta^{(s)}_0 \leftarrow \theta^{(s-1)}_{\text{final}}.\]

The PE adaptation rule depends on the PE family:

Learned absolute PE:
\[\mathbf{PE}^{(s)} = \mathcal{I}_{(H'_s,W'_s)}\!\left(\mathbf{PE}^{(s-1)}\right),\]
where $\mathcal{I}$ is interpolation.
Sinusoidal PE:
\[\mathbf{PE}^{(s)}(h,w)=\Phi_{\text{sin}}^{(s)}(h,w),\]
recomputed analytically for the new grid.
RoPE: no table update is needed; only the admissible coordinate range changes:
\[(h,w)\in[0,H'_s-1]\times[0,W'_s-1].\]

The practical advantage is that the early stage learns coarse global semantics on short sequences, while later stages specialize to finer detail at larger spatial grids. HunyuanVideo describes an analogous curriculum in the video case: low-resolution short video → low-resolution long video → high-resolution long video.

When the resolution increases:

Learned PE: The position embedding table is resized via interpolation, then fine-tuned at the new resolution.
Sinusoidal PE: Recomputed for the new grid size — no interpolation needed, though the model may need adaptation to the new position range.
RoPE: Position indices extend naturally. The model sees new (larger) position indices but the rotation operation itself is unchanged. Fine-tuning at the new resolution helps the model adapt.

PixArt-$\Sigma$ follows a coarse-to-fine scaling strategy toward 4K generation; HunyuanVideo similarly uses progressive curriculum learning across duration and resolution.

# PyTorch-like pseudocode: stage-wise progressive resolution training
def progressive_resolution_training(
    model,
    stage_configs,   # list of dicts: [{"H":16,"W":16,"steps":...}, ...]
    pe_type="rope",  # "learned", "sinusoidal", "rope"
):
    pos_embed = None

    for stage_id, cfg in enumerate(stage_configs):
        H, W = cfg["H"], cfg["W"]

        if pe_type == "learned":
            if pos_embed is None:
                pos_embed = init_learned_pos_embed(H, W, model.dim)
            else:
                old_H, old_W = pos_embed_hw(pos_embed)
                pos_embed = resize_abs_pos_embed(pos_embed, (old_H, old_W), (H, W))
            model.set_pos_embed(pos_embed)

        elif pe_type == "sinusoidal":
            pos_embed = sinusoidal_2d(H, W, model.dim, device=model.device)
            model.set_pos_embed(pos_embed[None])

        elif pe_type == "rope":
            # no table to resize; just record current coordinate range
            model.set_rope_grid_shape(H, W)

        train_for_steps(model, cfg["steps"], data_loader=cfg["loader"])

3.6 Handling Multi-Resolution for Video Generation

Video generation introduces an additional temporal dimension, compounding the multi-resolution challenge:

Spatial resolution: Different video resolutions (480p, 720p, 1080p).
Temporal length: Different frame counts.
Frame rate: Different temporal sampling rates.

Modern video diffusion Transformers therefore need a positional strategy that scales not only with $H',W'$, but also with temporal length $T'$. The most common answer in current large video DiTs is 3D RoPE. CogVideoX states that it uses 3D RoPE to model the positional relationship of videos of different shapes, and HunyuanVideo states explicitly that it uses RoPE in each Transformer block and extends it to three dimensions to support multi-resolution, multi-aspect-ratio, and varying-duration generation.

For a video latent grid of size

\[T' \times H' \times W',\]

each token has coordinates

\[(t,h,w).\]

A factorized 3D RoPE splits the head dimension into three segments:

\[\mathbf{q} = [\mathbf{q}^{(T)};\mathbf{q}^{(H)};\mathbf{q}^{(W)}], \qquad \mathbf{k} = [\mathbf{k}^{(T)};\mathbf{k}^{(H)};\mathbf{k}^{(W)}].\]

Then

\[\widetilde{\mathbf{q}}(t,h,w)= \left[ \mathbf{R}^{(T)}_t \mathbf{q}^{(T)}; \mathbf{R}^{(H)}_h \mathbf{q}^{(H)}; \mathbf{R}^{(W)}_w \mathbf{q}^{(W)} \right],\]

and similarly for $\widetilde{\mathbf{k}}(t,h,w)$.

The attention score between two spatiotemporal tokens at $(t_1,h_1,w_1)$ and $(t_2,h_2,w_2)$ is then

\[\widetilde{\mathbf{q}}(t_1,h_1,w_1)^\top \widetilde{\mathbf{k}}(t_2,h_2,w_2),\]

which depends on relative displacement along time, height, and width.

HunyuanVideo describes the implementation explicitly: it computes the rotary frequency matrix separately for time, height, and width, partitions the feature channels into three segments, applies the corresponding coordinate frequencies, and concatenates the results. This is exactly the canonical 3D factorized RoPE design.

A training system must also cope with heterogeneous durations and aspect ratios. HunyuanVideo therefore bucketizes videos by duration and aspect ratio, assigns each bucket an appropriate batch size, and uses a progressive curriculum from low-resolution short video to high-resolution long video. This is the video analogue of multi-resolution image bucketing plus stage-wise scaling.

If one uses bucket IDs

\[b = (b_T, b_{AR}),\]

then each bucket has its own target shape

\[(T'_b, H'_b, W'_b),\]

and per-sample coordinates are always generated inside that bucket’s coordinate range:

\[t\in{0,\ldots,T'_b-1},\quad h\in{0,\ldots,H'_b-1},\quad w\in{0,\ldots,W'_b-1}.\]

This keeps the batching system efficient without changing the underlying 3D RoPE definition.

# PyTorch-like pseudocode: 3D RoPE for video tokens
import torch

def make_3d_coords(T, H, W, device="cpu"):
    ts = torch.arange(T, device=device)
    ys = torch.arange(H, device=device)
    xs = torch.arange(W, device=device)
    grid_t, grid_y, grid_x = torch.meshgrid(ts, ys, xs, indexing="ij")
    return grid_t.reshape(-1), grid_y.reshape(-1), grid_x.reshape(-1)

def apply_3d_rope(q, k, coords_t, coords_h, coords_w, base=10000.0):
    """
    q, k: [B, N, Heads, D_head]
    coords_t/h/w: [N]
    """
    D = q.shape[-1]
    # simple 1/3 split; real models may use non-uniform splits
    D_t = D // 3
    D_h = D // 3
    D_w = D - D_t - D_h

    q_t, q_h, q_w = q[..., :D_t], q[..., D_t:D_t+D_h], q[..., D_t+D_h:]
    k_t, k_h, k_w = k[..., :D_t], k[..., D_t:D_t+D_h], k[..., D_t+D_h:]

    inv_t = build_rope_freqs(D_t, base=base, device=q.device)
    inv_h = build_rope_freqs(D_h, base=base, device=q.device)
    inv_w = build_rope_freqs(D_w, base=base, device=q.device)

    q_t = apply_1d_rope(q_t, coords_t, inv_t)
    k_t = apply_1d_rope(k_t, coords_t, inv_t)
    q_h = apply_1d_rope(q_h, coords_h, inv_h)
    k_h = apply_1d_rope(k_h, coords_h, inv_h)
    q_w = apply_1d_rope(q_w, coords_w, inv_w)
    k_w = apply_1d_rope(k_w, coords_w, inv_w)

    q = torch.cat([q_t, q_h, q_w], dim=-1)
    k = torch.cat([k_t, k_h, k_w], dim=-1)
    return q, k

def assign_video_bucket(T, H, W, duration_bins, aspect_bins):
    # toy bucketization logic
    ar = W / H
    bT = min(range(len(duration_bins)), key=lambda i: abs(duration_bins[i] - T))
    bAR = min(range(len(aspect_bins)), key=lambda i: abs(aspect_bins[i] - ar))
    return bT, bAR

3.7 Summary of Multi-Resolution Strategies

Strategy	PE Type	Resolution Change	Aspect Ratio Change	Extrapolation	Models
Interpolation	Learned / absolute grid	Resize + fine-tune	Limited	Poor	Early ViT/DiT-style absolute PE adaptation
Direct computation	Functional sinusoidal / frequency-embedded grid	Recompute	Natural	Moderate	Functional absolute-PE pipelines; SDXL-style metadata conditioning can complement this
Native variable-length	RoPE	Extend indices	Natural	Good	FiT, HunyuanDiT, FLUX-style 2D RoPE systems
Frequency scaling	RoPE	Scale coordinates or $\tau$	Natural	Good	FiT-style PI, Lumina-T2X / Next-DiT NTK-aware scaling
Normalized coords	Any functional coordinate map	Map to $[0,1]$	Natural	Excellent	Fractional-coordinate variants, some custom models
Packing	Factorized 2D	Per-image	Per-image	Good	NaViT-style packed training
Progressive training	Any	Stage-wise adapt	Stage-wise	N/A	PixArt-Σ, HunyuanVideo curricula

Overall, the field has moved from resizing a learned absolute table toward functional, coordinate-driven schemes. In image generation, this means direct sinusoidal evaluation, 2D RoPE, and RoPE extrapolation tricks; in video generation, this means 3D RoPE plus duration/aspect-ratio bucketing and progressive curricula. The common design principle is simple: the more position encoding is expressed as a function of coordinates rather than a fixed table of seen positions, the easier it is to support arbitrary resolution and aspect ratio.

4. Time and Noise Embeddings in Diffusion and Flow Matching Models

In diffusion and flow matching generative models, the denoising (or velocity) network is trained to reverse a progressive noising process. The forward process is generally written as:

\[\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\mathbf{I})\]

where $\alpha_t$ and $\sigma_t$ define a noise schedule that continuously transforms clean data $\mathbf{x}_0$ into noise. The network $f_\theta(\mathbf{x}_t, t)$ must undo this corruption — but the optimal operation changes dramatically across the process. At high noise levels the input is nearly isotropic Gaussian and the network must hallucinate global structure from almost nothing; at low noise levels most structure is already present and the network must only add subtle high-frequency details. Without knowing the current noise level, the network faces a fundamentally ill-posed problem: the same input $\mathbf{x}_t$ can arise from vastly different pairs $(\mathbf{x}_0, \epsilon)$ depending on the ratio of $\alpha_t$ to $\sigma_t$.

The time or noise embedding is the mechanism that resolves this ambiguity. It converts the scalar conditioning variable (timestep, noise level, or an equivalent quantity) into a high-dimensional vector and injects it into the network so that every layer can adapt its behavior to the current denoising stage. Its design involves three interlinked decisions:

What to embed — the choice of scalar parameterization ($t$, $\sigma$, $\log \sigma$, SNR, $\log \mathrm{SNR}$, etc.)
How to embed — the mapping from scalar to high-dimensional vector (sinusoidal encoding, Fourier features, learned embeddings, etc.)
How to inject — the mechanism that fuses the embedding into the network’s feature representations (addition, adaptive normalization, token concatenation, etc.)

These choices interact with the noise schedule, the loss weighting, the network architecture, and the preconditioning strategy. This chapter provides a systematic treatment of each.

4.1 Parameterizations of the Conditioning Variable

All standard parameterizations encode the same underlying information — the position along the noising trajectory — and are related by monotonic transformations. Nevertheless, they differ in numerical range, uniformity, and mathematical naturality, which affects both optimization dynamics and embedding quality.

4.1.1 Discrete Timestep

The original DDPM ²⁰ defines a discrete Markov chain with $T = 1000$ steps. Each step applies:

\[q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)\]

The integer timestep $t$ is directly fed into a sinusoidal positional encoding. This is simple and natural for the discrete formulation, but it tightly couples the conditioning to a specific number of diffusion steps $T$ and a specific schedule ${\beta_t}$. Any change to the schedule requires reinterpreting what each integer means.

Improved DDPM [^iddpm] and Latent Diffusion Models (LDM) ² retain this convention. LDM and Stable Diffusion apply the diffusion process in a learned latent space but still use integer timestep embeddings in the UNet backbone.

4.1.2 Continuous Time

Score SDE ²¹ generalizes discrete diffusion to continuous-time stochastic differential equations, with $t$ ranging over $[0,T]$ (commonly rescaled to $[0,1]$). Flow matching ²² ²³ also defines the generative process over continuous $t \in [0,1]$, with the interpolant:

\[\mathbf{x}_t = (1 - t)\mathbf{x}_0 + t \epsilon \qquad (\text{or the reverse convention})\]

Continuous time decouples the conditioning variable from any specific discretization, making it compatible with adaptive ODE/SDE solvers and arbitrary step counts at inference. This is the standard parameterization in modern flow matching models such as Stable Diffusion 3 ³, SiT ⁸, and FLUX.

Caveat: The relationship between $t$ and the actual noise level depends on the schedule. A linear schedule in $t$ does not produce uniform coverage of noise levels. SD3 ³ addresses this with a timestep shifting strategy: they reparameterize $t$ via

\[t' = \frac{t \cdot s}{1 + (s - 1)\cdot t}\]

with a resolution-dependent shift factor $s$, ensuring that higher-resolution models spend more time at higher noise levels where the denoising task is harder.

4.1.3 Noise Level

EDM ²⁴ argues that the network should be parameterized directly in terms of the noise standard deviation $\sigma$ rather than an abstract timestep. In EDM’s formulation:

\[\mathbf{x} = \mathbf{x}_0 + \sigma \epsilon\]

(with a suitable input scaling). The noise level $\sigma$ has a clear physical meaning: it is the standard deviation of the additive Gaussian noise. During training, EDM samples $\ln \sigma \sim \mathcal{N}(P_{\text{mean}}, P_{\text{std}}^2)$ (with $P_{\text{mean}} = -1.2$, $P_{\text{std}} = 1.2$ for ImageNet), producing a log-normal distribution over $\sigma$. This allows the model to see both very small and very large noise levels.

Advantage: Schedule-independent. The network can be trained with one noise distribution and sampled with another, because its conditioning is the physically meaningful $\sigma$ rather than an opaque $t$.

Disadvantage: The raw $\sigma$ spans a wide dynamic range (e.g., $\sigma \in [0.002, 80]$ in EDM), making it unsuitable for direct use as a network input.

4.1.4 Log Noise Level

To compress the dynamic range, EDM ²⁴ conditions the network on a transformed noise level:

\[c_{\text{noise}} = \frac{1}{4}\ln \sigma\]

The factor $1/4$ maps the typical training range of $\ln \sigma$ (roughly $[-6, 4]$) to approximately $[-1.5, 1.0]$, which is a numerically well-behaved input for positional encodings. EDM2 ²⁵ retains this convention.

$\log \sigma$ is a natural choice because the perceptually important range of noise levels is roughly log-uniform: doubling $\sigma$ has roughly the same qualitative effect whether $\sigma$ goes from $0.1$ to $0.2$ or from $10$ to $20$. Conditioning on $\log \sigma$ gives each “octave” of noise equal representation in the embedding space.

4.1.5 Signal-to-Noise Ratio (SNR) and Log SNR

The signal-to-noise ratio is defined as:

\[\mathrm{SNR}(t) = \frac{\alpha_t^2}{\sigma_t^2}\]

It monotonically decreases from $+\infty$ (clean data) to $0$ (pure noise) along the forward process. VDM [^vdm] shows that the variational lower bound of diffusion models can be written entirely in terms of the SNR schedule:

\[\mathcal{L}_{\mathrm{VLB}} = \frac{1}{2}\,\mathbb{E}_{\epsilon,t} \left[ -\frac{d\,\mathrm{SNR}(t)}{dt} \left\| \epsilon - \hat{\epsilon}_\theta(\mathbf{x}_t, t) \right\|^2 \right] + C\]

This makes SNR mathematically privileged: it is the quantity that directly governs the loss, independent of the particular schedule parameterization.

The log signal-to-noise ratio is:

\[\lambda_t = \log \mathrm{SNR}(t) = \log \frac{\alpha_t^2}{\sigma_t^2} = 2 \log \alpha_t - 2 \log \sigma_t\]

$\lambda_t$ ranges from $+\infty$ to $-\infty$. Simple Diffusion ²⁶ and several analytical works use $\lambda_t$ as the primary variable because the diffusion loss per unit of $\lambda$ is approximately constant across a wide range, suggesting that equal spacing in log SNR corresponds to “equal perceptual importance” of different noise levels. The $v$-prediction parameterization introduced by Salimans and Ho ²⁷ is also closely tied to the SNR: the optimal $v$-prediction target naturally balances the $\epsilon$-prediction and $\mathbf{x}_0$-prediction objectives as a function of $\mathrm{SNR}(t)$.

4.1.6 Mathematical Relationships and Conversions

All parameterizations are connected by monotonic transformations. Given a VP (variance-preserving) schedule with

\[\alpha_t^2 + \sigma_t^2 = 1\]

From	To	Conversion
$t$ (continuous)	$\sigma_t$	Via the schedule: $\sigma_t = \sigma(t)$
$\sigma$	SNR	$\mathrm{SNR} = \alpha^2 / \sigma^2$. For VE: $\mathrm{SNR} = 1/\sigma^2$. For VP: $\mathrm{SNR} = (1-\sigma^2)/\sigma^2$
$\sigma$	$\log \sigma$	$\log \sigma = \ln \sigma$
SNR	$\log \mathrm{SNR}$	$\lambda = \ln \mathrm{SNR}$
$\log \sigma$	$\lambda$	For VE $(\alpha = 1)$: $\lambda = -2 \log \sigma$

Because the transformations are monotonic, the choice of parameterization carries no intrinsic information-theoretic difference — any injective function of $t$ is equally informative. The practical differences lie in:

Numerical conditioning: $\log \sigma$ and $\log \mathrm{SNR}$ compress the dynamic range, making the embedding input more uniform and better conditioned for neural network processing.
Schedule independence: $\sigma$, $\log \sigma$, SNR, and $\log \mathrm{SNR}$ are intrinsic to the noise level and independent of how the schedule maps time to noise. This allows schedule changes without retraining.
Loss-theoretic naturality: SNR and $\log \mathrm{SNR}$ appear directly in the diffusion loss, making them natural choices for analyses of loss weighting and schedule design.
Simplicity: Discrete or continuous $t$ is the simplest and most intuitive parameterization, requiring no schedule-dependent preprocessing.

4.2 Embedding Algorithms

Once the scalar conditioning variable $s$ (whether $t$, $\sigma$, $\log \sigma$, etc.) is chosen, it must be lifted into a high-dimensional vector that the network can process. Neural networks are notoriously poor at extracting useful features from raw scalar inputs [^fourierfeatures]; positional encodings address this by projecting the scalar onto a rich set of basis functions before any learned processing.

4.2.1 Sinusoidal Positional Embeddings

The most widely used embedding is borrowed directly from the Transformer architecture [^transformer]:

\[\mathrm{PE}(s, 2k) = \sin\!\left(\frac{s}{\tau^{2k/d}}\right), \qquad \mathrm{PE}(s, 2k+1) = \cos\!\left(\frac{s}{\tau^{2k/d}}\right)\]

for $k = 0,1,\ldots,d/2 - 1$, where $d$ is the embedding dimension and $\tau = 10{,}000$ is the standard base frequency. This produces a $d$-dimensional vector from the scalar $s$.

Properties:

Deterministic and fixed — no learnable parameters in the encoding itself.
Multi-scale: The geometric progression of frequencies $\omega_k = 1/\tau^{2k/d}$ spans a wide range, from $\omega_0 = 1$ (captures coarse variation in $s$) to $\omega_{d/2-1} \approx 1/\tau$ (captures fine variation). This allows the network to distinguish both widely and closely spaced values of $s$.
Smoothness: The encoding varies smoothly with $s$, ensuring that nearby timesteps produce similar embeddings.

Usage: DDPM ²⁰, ADM ²⁸, LDM ², SDXL ²⁹, Imagen ²⁷, DiT ¹, SD3 ³, PixArt-$\alpha$ ¹¹, and the vast majority of published diffusion models use sinusoidal positional embeddings. The typical embedding dimension is $d = 256$ or $d = 128$.

Note on input range: The original Transformer uses integer positions $s \in {0,1,2,\ldots}$. When the input is a continuous value in $[0,1]$ or a log-noise-level, the effective frequency range may need adjustment. Some implementations multiply $s$ by a scaling factor (e.g., $s \leftarrow 1000 \cdot t$) to match the frequency range designed for integer inputs. EDM ²⁴ handles this by using $c_{\text{noise}} = \frac{1}{4}\ln \sigma$ to place the input in a suitable numerical range.

4.2.2 Random Fourier Features

Score SDE ²¹ introduces Gaussian random Fourier features for the noise-conditional architectures (NCSN++):

\[\gamma(s) = \left[ \sin(2\pi w_1 s), \cos(2\pi w_1 s), \ldots, \sin(2\pi w_{d/2} s), \cos(2\pi w_{d/2} s) \right]\]

where $w_i \sim \mathcal{N}(0, s_{\text{scale}}^2)$ are sampled once at initialization and then fixed throughout training. The scale hyperparameter $s_{\text{scale}}$ controls the frequency bandwidth.

Properties:

Theoretically motivated by kernel approximation: random Fourier features approximate a shift-invariant kernel (the Gaussian RBF kernel), enabling the network to learn smooth functions of the input scalar [^fourierfeatures].
Stochastic initialization means different random seeds produce different embeddings, introducing a source of non-determinism in architecture design. In practice this has negligible effect on final performance.
The frequencies are not geometrically spaced (unlike sinusoidal embeddings) but drawn from a Gaussian, providing dense coverage around zero frequency and sparser coverage at high frequencies.

Usage: NCSN++ in Score SDE ²¹. Some follow-up works adopt this approach, but sinusoidal embeddings have become more dominant in later models.

4.2.3 Learned Fourier Features

A natural extension is to make the frequencies learnable:

\[\gamma(s) = \left[ \sin(2\pi w_1 s + b_1), \cos(2\pi w_1 s + b_1), \ldots \right]\]

where $w_i$ and $b_i$ are trainable parameters. This allows the network to adapt the frequency allocation to the task.

Tancik et al. [^fourierfeatures] show that both random and learned Fourier features dramatically outperform raw coordinate inputs for learning high-frequency functions. The learned variant can in principle allocate more frequencies to noise-level ranges where the denoising task changes most rapidly.

In practice, the difference between random and learned Fourier features is often small when followed by a sufficiently expressive MLP, and sinusoidal embeddings (which use deterministic geometric frequencies) work comparably well.

4.2.4 Learned Lookup Embeddings

For discrete timesteps, one can use a learnable embedding table $\mathbf{E} \in \mathbb{R}^{T \times d}$:

\[\mathbf{e}_t = \mathbf{E}[t]\]

This is analogous to token embeddings in language models. It is maximally flexible — each timestep gets an independent vector — but does not generalize to unseen timesteps and scales poorly to large $T$ or continuous time. For this reason, it is rarely used as the sole embedding in modern diffusion models, though some works combine it with sinusoidal features.

4.2.5 MLP Post-Processing

Regardless of the positional encoding method, nearly all models follow it with a learned MLP to produce the final conditioning vector:

\[\mathbf{e} = \mathrm{MLP}(\mathrm{PE}(s)) = \mathbf{W}_2 \cdot \phi(\mathbf{W}_1 \cdot \mathrm{PE}(s) + \mathbf{b}_1) + \mathbf{b}_2\]

where $\phi$ is a nonlinear activation, typically SiLU (Swish) ²⁰ ²⁸. The MLP maps the fixed (or random) positional encoding to a learned, task-adaptive representation in the model’s hidden dimension $D$.

Standard configuration (used in DDPM, ADM, DiT, SD3, etc.):

Sinusoidal PE: $s \mapsto \mathbb{R}^d$ (e.g., $d = 256$)
Linear: $\mathbb{R}^d \rightarrow \mathbb{R}^D$ (e.g., $D = 1024$ or $1152$)
SiLU activation
Linear: $\mathbb{R}^D \rightarrow \mathbb{R}^D$

This two-layer MLP is sometimes called the timestep embedding MLP or time MLP. Its output $\mathbf{e} \in \mathbb{R}^D$ serves as the global conditioning vector that is injected into every block of the network.

4.2.6 Combining Time Embeddings with Other Conditioning

In conditional generation models, the time embedding is often combined with other conditioning signals before injection:

DiT ¹: Time embedding and label embedding are combined to form global control signals.
$\mathbf{e} = \mathrm{MLP}(\mathrm{PE}(t) + \mathbf{e}_{\text{class}})$$
where $\mathbf{e}_{\text{class}}$ is a learned class embedding.
SD3 / MMDiT ³: Time embedding and pool text prompt embedding are combined to form global control signals.
$\mathbf{e} = \mathrm{MLP}([\mathrm{PE}(t); \mathbf{e}_{\text{pool}}])$$
where $\mathbf{e}_{\text{pool}}$ is the pooled text embedding and $[\;;\;]$ denotes concatenation.
SDXL ²⁹: Concatenates sinusoidal embeddings of $t$, original image size $(h_{\text{orig}}, w_{\text{orig}})$, and crop coordinates $(c_{\text{top}}, c_{\text{left}})$ before MLP processing.

This pattern makes the time MLP a general-purpose global conditioning encoder rather than a time-only module.

4.3 Injection Methods

The conditioning vector $\mathbf{e}$ must be fused into the network’s feature representations. The injection method determines how strongly and in what way the time information modulates computation. Different architectures (UNet vs. Transformer) favor different injection mechanisms.

4.3.1 Additive Injection

The simplest approach, used in DDPM ²⁰:

\[\mathbf{h}' = \mathbf{h} + \mathrm{Linear}(\mathrm{SiLU}(\mathbf{e}))\]

where $\mathbf{h} \in \mathbb{R}^{C \times H \times W}$ is the feature map in a residual block and the linear projection maps $\mathbf{e} \in \mathbb{R}^D$ to $\mathbb{R}^C$, broadcast across spatial dimensions. This is applied between the two convolutional layers of each ResNet block.

Properties:

Simple to implement.
Acts as a spatially-uniform bias shift, which limits expressiveness — it cannot perform channel-wise rescaling or spatially-varying modulation.
Effective enough for the original DDPM but superseded by more expressive methods.

4.3.2 Adaptive Group Normalization (AdaGN)

ADM (Guided Diffusion) ²⁸ replaces additive injection with Adaptive Group Normalization:

\[\mathrm{AdaGN}(\mathbf{h}, \mathbf{e}) = \mathbf{y}_s \odot \mathrm{GroupNorm}(\mathbf{h}) + \mathbf{y}_b\]

where $[\mathbf{y}_s, \mathbf{y}_b] = \mathrm{Linear}(\mathrm{SiLU}(\mathbf{e}))$ produces per-channel scale and bias vectors from the time embedding. This is an instance of FiLM (Feature-wise Linear Modulation) [^film] applied after GroupNorm.

Properties:

Multiplicative modulation $(\mathbf{y}_s)$ allows the network to amplify or suppress entire feature channels as a function of noise level, providing significantly more expressiveness than additive injection.
GroupNorm first normalizes each feature group to zero mean and unit variance, then the learned affine transform re-scales based on $\mathbf{e}$. This decouples the feature statistics from the content and lets the conditioning fully control the output distribution.
Became the de facto standard for UNet-based diffusion models. Used in ADM ²⁸, Imagen ²⁷, LDM ², SDXL ²⁹, and many others.

4.3.3 Adaptive Layer Normalization (AdaLN) and AdaLN-Zero

With the shift from UNet to Transformer architectures, Adaptive Layer Normalization (AdaLN) replaced AdaGN as the primary injection mechanism. DiT ¹ introduces AdaLN-Zero, the most influential variant:

For each Transformer block, the conditioning vector $\mathbf{e}$ is projected to six per-dimension parameter vectors:

\[[\gamma_1, \beta_1, \alpha_1, \gamma_2, \beta_2, \alpha_2] = \mathrm{Linear}(\mathrm{SiLU}(\mathbf{e}))\]

The block computation becomes:

\[\mathbf{x}' = \mathbf{x} + \alpha_1 \odot \mathrm{Attention}\!\left( (1+\gamma_1)\odot \mathrm{LN}(\mathbf{x}) + \beta_1 \right)\] \[\mathbf{x}'' = \mathbf{x}' + \alpha_2 \odot \mathrm{FFN}\!\left( (1+\gamma_2)\odot \mathrm{LN}(\mathbf{x}') + \beta_2 \right)\]

Key innovation — zero initialization: The linear layer producing $[\gamma, \beta, \alpha]$ is initialized to output all zeros. When $\gamma = 0$ and $\beta = 0$, the AdaLN reduces to standard LayerNorm. When $\alpha = 0$, the residual branch contributes nothing, making the entire block act as the identity function at initialization. This creates a smooth optimization landscape at the start of training and was shown by Peebles and Xie ¹ to significantly outperform other conditioning strategies (including cross-attention and simple AdaLN without zero initialization).

Usage: AdaLN-Zero has become the standard for diffusion transformers. It is used (with minor variations) in DiT ¹, SD3 / MMDiT ³, PixArt-$\alpha$ ¹¹, SiT ⁸, and FLUX.

Variant — Adaptive RMSNorm: Some recent architectures (e.g., FLUX) replace LayerNorm with RMSNorm in the AdaLN formulation, dropping the mean-centering step for slight computational savings while retaining the adaptive scale-and-shift mechanism.

4.3.4 Token-Based Injection

U-ViT ⁹ proposes an elegant alternative: treat the timestep embedding as an extra token in the input sequence:

\[\mathbf{Z} = [\mathbf{e}_t; \mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_N]\]

where $\mathbf{e}_t$ is the time embedding vector (same dimension as patch tokens $\mathbf{z}_i$) and the concatenated sequence is processed by standard Transformer blocks without any architectural modification. The time information is propagated through self-attention, allowing every patch token to attend to the time token and vice versa.

Properties:

Architecturally minimal: No AdaLN, no extra projections per block — just one additional token. This makes the implementation simpler and the architecture more uniform.
Flexible conditioning: By treating all conditioning as tokens (time, class, text), U-ViT uses a single unified mechanism.
Potential weakness: The conditioning is “soft” — the network must learn to extract the time information through attention weights, which may be less direct than AdaLN’s hard-coded modulation. DiT ¹ found that AdaLN-Zero outperformed token-based conditioning (“in-context conditioning”) on ImageNet class-conditional generation.
U-ViT ⁹ adds a skip connection from the input time token to the output, analogous to UNet skip connections, to ensure the time signal is not diluted.

4.3.5 Modulated Convolutions

Inspired by StyleGAN2 [^stylegan2], some diffusion models use weight modulation in convolutional layers:

\[w'_{ijk} = s_i \cdot w_{ijk}, \qquad s_i = f(\mathbf{e})_i\]

where $w_{ijk}$ are the base convolution weights, $s_i$ is a per-input-channel scale factor derived from the conditioning, and weight demodulation ensures unit variance of outputs. This approach modulates the convolution kernel itself rather than the features, providing a different inductive bias.

EDM ²⁴ uses magnitude-preserving network operations combined with conditioning-dependent modulation. While not identical to StyleGAN2’s modulated convolutions, the principle is related: the conditioning signal controls the effective weights of the network.

4.3.6 Cross-Attention Conditioning

Cross-attention is the standard mechanism for injecting spatially-varying, sequence-valued conditioning such as text embeddings ². For time conditioning specifically, it is less common because the timestep is a single global scalar, not a sequence. However, some architectures use cross-attention for time conditioning:

\[\mathrm{CrossAttn}(\mathbf{h}, \mathbf{e}) = \mathrm{softmax}\!\left( \frac{Q(\mathbf{h})K(\mathbf{e})^\top}{\sqrt{d}} \right)V(\mathbf{e})\]

DiT ¹ explored cross-attention conditioning for class labels and found it underperformed AdaLN-Zero, likely because the single-vector conditioning does not benefit from the sequence-to-sequence nature of cross-attention.

4.4 Additional Considerations

4.4.1 Preconditioning and Its Relationship to Time Conditioning

EDM ²⁴ introduces preconditioning: wrapping the raw network $F_\theta$ with noise-dependent input/output scaling:

\[D_\theta(\mathbf{x}; \sigma) = c_{\text{skip}}(\sigma)\,\mathbf{x} + c_{\text{out}}(\sigma)\, F_\theta\!\left(c_{\text{in}}(\sigma)\,\mathbf{x}; c_{\text{noise}}(\sigma)\right)\]

where $c_{\text{skip}}$, $c_{\text{out}}$, and $c_{\text{in}}$ are analytically derived scalar functions of $\sigma$ that normalize the magnitudes of the network’s input and target. This separates two roles of the noise level:

Magnitude control $(c_{\text{skip}}, c_{\text{out}}, c_{\text{in}})$: Handled by the preconditioning wrapper, outside the network.
Behavior control $(c_{\text{noise}})$: Handled by the embedding, inside the network.

By offloading the magnitude normalization to the wrapper, the internal embedding $c_{\text{noise}}$ only needs to communicate what kind of denoising to perform, not how much to scale the output. EDM2 ²⁵ extends this with magnitude-preserving layers throughout the architecture, further reducing the burden on the time embedding.

4.4.2 Timestep Sampling and Its Interaction with Embeddings

The distribution from which $t$ (or $\sigma$) is sampled during training affects which noise levels the embedding must handle most frequently:

Uniform sampling (DDPM ²⁰): All timesteps equally likely.
\[t \sim \mathrm{Uniform}\{1,\ldots,T\}\]
Log-normal sampling (EDM ²⁴): Concentrates on “medium” noise levels.
\[\ln \sigma \sim \mathcal{N}(P_{\text{mean}}, P_{\text{std}}^2)\]
Logit-normal sampling (SD3 ³): Biases toward intermediate timesteps where the loss is highest, improving training efficiency for flow matching.
\[t = \mathrm{sigmoid}(z), \qquad z \sim \mathcal{N}(\mu, s^2)\]
Importance sampling (Simple Diffusion ²⁶): Samples $t$ proportionally to the loss magnitude.

The embedding quality matters most in the regions where $t$ is sampled most frequently. A log-uniform sampling strategy pairs naturally with a $\log \sigma$ parameterization, ensuring that the embedding’s “resolution” matches the training distribution.

4.4.3 Handling of Schedule Endpoints

The behavior at the boundaries $(t = 0 \text{ and } t = 1,\text{ or equivalently } \sigma = 0 \text{ and } \sigma = \infty)$ requires care:

Zero terminal SNR ³⁰: Lin et al. show that many commonly used noise schedules (e.g., the original DDPM linear schedule) do not reach $\mathrm{SNR} = 0$ at $t = T$, meaning the model never sees pure noise. This causes artifacts at inference when sampling starts from $\mathcal{N}(0,\mathbf{I})$. Enforcing $\mathrm{SNR}(T)=0$ and adapting the training accordingly is important. The time embedding must be able to represent this boundary: $\log \mathrm{SNR} = -\infty$ and $\log \sigma = +\infty$ are problematic, motivating the use of bounded parameterizations or careful clamping.
Clean data boundary: At $t = 0$ $(\sigma = 0)$, the embedding should represent “no noise.” For flow matching models that predict the velocity at $t = 0$, the network output at this boundary defines the final denoising step. Consistency Models ³¹ place particular emphasis on the boundary condition at the clean-data end.

4.4.4 Embedding Dimension and Computational Cost

The time embedding is a tiny fraction of total computation. The sinusoidal encoding and two-layer MLP are applied once per forward pass (not per token or per pixel). The per-block projection (e.g., from $\mathbf{e} \in \mathbb{R}^D$ to $6D$ parameters in AdaLN-Zero) adds a modest overhead per block. In practice, the time conditioning pathway accounts for $< 1\%$ of total FLOPs.

Typical dimensions:

Positional encoding: $d = 256$
MLP hidden / output: $D =$ model width (e.g., 768, 1024, 1152)
Per-block AdaLN projection: $D \rightarrow 6D$ (DiT), $D \rightarrow 2C$ (ADM, per ResNet block)

5. Class Label Embeddings

Class labels are the simplest but still very important form of conditioning in diffusion models: they provide semantic control, improve sample fidelity (especially on class-conditional benchmarks like ImageNet), and are a key building block for more complex conditioning (attributes, tags, multi-label concepts). In the DiT paper, class-conditional training is central to the evaluation setup, and the paper explicitly notes that the conditioning mechanism itself materially affects quality (not just compute or parameter count).

Historically, class conditioning also became tightly connected to guidance methods. ADM-style class-conditional diffusion and later classifier(-free) guidance showed that conditioning can significantly improve sample quality and controllability. The ADM abstract explicitly highlights that sample quality can be strongly improved by conditioning and guidance, and DiT later confirms the same trend for transformer backbones.

5.1 Label Embedding and Injection

The first step in class-conditional diffusion modeling is to convert a discrete class index $y \in \{0, 1, \ldots, K-1\}$ into a continuous vector representation suitable for neural network processing. This section surveys the principal embedding algorithms and injection mechanisms employed in state-of-the-art diffusion models.

5.1.1 Learnable Embedding Table + MLP Projection Head

The most widely adopted approach is a learnable embedding lookup table, mathematically equivalent to a linear projection of a one-hot encoded vector:

\[\mathbf{e}_y = \mathbf{W}_{\text{emb}}[y] \in \mathbb{R}^{D}\]

where $\mathbf{W}_{\text{emb}} \in \mathbb{R}^{K \times D}$ is a learnable weight matrix, $K$ is the number of classes, and $D$ is the embedding dimensionality. Each row of $\mathbf{W}_{\text{emb}}$ stores a $D$-dimensional embedding vector for the corresponding class.

class ClassEmbedder(nn.Module):
    def __init__(self, num_classes: int, embed_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(num_classes, embed_dim)

    def forward(self, y: torch.LongTensor) -> torch.Tensor:
        # y: [B] integer class labels
        return self.embedding(y)  # [B, D]

In practice, the raw embedding vector $\mathbf{e}_y$ is rarely used directly. Instead, it is passed through a multi-layer perceptron (MLP) projection head to increase representational capacity and align the class embedding with the model’s internal feature space:

\[\mathbf{c}_y = \text{MLP}(\mathbf{e}_y) = \mathbf{W}_2 \cdot \sigma(\mathbf{W}_1 \cdot \mathbf{e}_y + \mathbf{b}_1) + \mathbf{b}_2\]

where $\sigma(\cdot)$ denotes a nonlinear activation function (commonly SiLU/Swish or GELU). The full class embedding pipeline is thus:

\[y \;\xrightarrow{\text{lookup}}\; \mathbf{e}_y \;\xrightarrow{\text{MLP}}\; \mathbf{c}_y \in \mathbb{R}^{D}\]

class LabelEmbedder(nn.Module):
    def __init__(self, num_classes: int, hidden_dim: int):
        super().__init__()
        self.embedding_table = nn.Embedding(num_classes, hidden_dim)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )

    def forward(self, y: torch.LongTensor) -> torch.Tensor:
        emb = self.embedding_table(y)   # [B, D]
        return self.mlp(emb)            # [B, D]

The MLP head serves several purposes: (1) it introduces nonlinearity, enabling the model to learn complex mappings from label space to conditioning space; (2) it provides additional learnable parameters for refining the embedding; and (3) in many architectures, this MLP is shared with the timestep embedding pipeline (see Section 4), meaning the class embedding must be mapped into the same representation space as the timestep signal.

5.1.2 One-Hot Encoding with Linear Projection

An alternative, mathematically equivalent formulation explicitly constructs a one-hot vector and applies a linear transformation:

\[\mathbf{e}_y = \mathbf{W} \cdot \text{one_hot}(y) + \mathbf{b}\]

While this produces identical results to nn.Embedding when no bias term is used, the one-hot formulation is sometimes preferred in settings where the label space is small or where the embedding is part of a larger input concatenation. Early diffusion models, including DDPM (Ho et al., 2020), used variants of this approach. In modern practice, nn.Embedding is universally preferred for its computational efficiency, as it avoids materializing the sparse one-hot vector.

5.1.3 Pretrained Encoder Embeddings

An alternative paradigm bypasses learned embedding tables entirely by leveraging pretrained encoders to derive class representations:

CLIP-based class embeddings. The class label is first converted to its textual name (e.g., $y = 207 \rightarrow$ "golden retriever"), and then encoded using a frozen pretrained text encoder such as CLIP (Radford et al., 2021):

\[\mathbf{c}_y = \text{CLIP}_{\text{text}}(\texttt{class_name}(y))\]

This approach is used in models that unify class-conditional and text-conditional generation under a single architecture. For example, PixArt-$\alpha$ (Chen et al., 2024) encodes ImageNet class names through a T5 text encoder when performing class-conditional experiments. The advantages include access to rich pretrained semantic knowledge and the potential for zero-shot generalization to unseen class names. The disadvantages include inference overhead and the introduction of a dependency on an external model whose representation space may not be optimally aligned with the diffusion model’s internal features.

Learned prototype embeddings. Some approaches initialize class embeddings from pretrained feature extractors (e.g., class centroids from a pretrained classifier’s penultimate layer) and optionally fine-tune them. This provides a semantically informed initialization that can accelerate training convergence.

5.2 Design of Null Class Embeddings

There are several strategies for constructing the null class embedding $\mathbf{c}_\varnothing$:

Dedicated Learnable Null Token. The most common approach allocates an additional entry in the embedding table to serve as the null class:

\[\mathbf{c}_\varnothing = \mathbf{W}_{\text{emb}}[K] \in \mathbb{R}^D\]

where index $K$ (beyond the valid class range ${0, \ldots, K-1}$) is reserved for the unconditional case. The null embedding is learned jointly with all class embeddings.

class LabelEmbedder(nn.Module):
    def __init__(self, num_classes: int, hidden_dim: int, dropout_prob: float):
        super().__init__()
        use_cfg = dropout_prob > 0
        # +1 for the null class token if using CFG
        self.embedding_table = nn.Embedding(
            num_classes + int(use_cfg), hidden_dim
        )
        self.num_classes = num_classes
        self.dropout_prob = dropout_prob

    def token_drop(self, labels: torch.LongTensor, force_drop: bool = False
    ) -> torch.LongTensor:
        if force_drop:
            # Replace all labels with the null class index
            return torch.full_like(labels, self.num_classes)
        elif self.training and self.dropout_prob > 0:
            # Randomly replace labels with null class index
            drop_mask = torch.rand_like(labels, dtype=torch.float) < self.dropout_prob
            labels = torch.where(drop_mask, self.num_classes, labels)
        return labels

    def forward(self, labels: torch.LongTensor, force_drop: bool = False
    ) -> torch.Tensor:
        labels = self.token_drop(labels, force_drop)
        embeddings = self.embedding_table(labels)   # [B, D]
        return embeddings

This is the approach used in DiT, SiT, and most Transformer-based diffusion models. The null embedding is learned to represent the “average” or “uninformative” conditioning state, allowing the model to learn a meaningful unconditional mode.

Fixed Zero Vector. A simpler alternative sets the null embedding to a fixed zero vector:
\[\mathbf{c}_\varnothing = \mathbf{0} \in \mathbb{R}^D\]
When the class embedding is injected via addition to the timestep embedding ($\mathbf{c} = \mathbf{c}_t + \mathbf{c}_y$), using a zero vector for $\mathbf{c}_\varnothing$ effectively reduces the conditioning signal to the timestep alone. This is parameter-free and conceptually clean, but it constrains the model’s ability to learn an optimal unconditional representation, as the null state is fixed rather than learned.
Standalone Learnable Null Token. Some implementations define the null embedding as a standalone learnable parameter rather than an entry in the embedding table:
```
self.null_class_emb = nn.Parameter(torch.randn(1, hidden_dim) * 0.02)
```
This is functionally similar to the dedicated table entry approach but separates the null embedding from the class embedding table, which can be useful when the embedding table is frozen (e.g., when using pretrained embeddings).

5.2.1 Training with Label Dropout

To enable CFG, the model is trained with random label dropout: during each training step, the class label is replaced with the null token with probability $p_{\text{uncond}}$:

\[y_{\text{train}} = \begin{cases} y & \text{with probability } 1 - p_{\text{uncond}} \\[10pt] \varnothing & \text{with probability } p_{\text{uncond}} \end{cases}\]

Typical values of $p_{\text{uncond}}$ range from 0.1 to 0.2. DiT uses $p_{\text{uncond}} = 0.1$; ADM uses $p_{\text{uncond}} = 0.2$. This dropout ratio represents a trade-off: higher dropout rates improve the quality of the unconditional model (and thus the effectiveness of CFG at inference time) but reduce the fraction of training steps that learn the conditional distribution.

5.2.2 Inference: CFG with Null Embeddings

At inference time, each denoising step requires two forward passes (or a single batched forward pass with doubled batch size):

def cfg_denoise(model, x_t, t, y, guidance_scale):
    # Conditional prediction
    eps_cond = model(x_t, t, y, force_drop=False)
    # Unconditional prediction (using null embedding)
    eps_uncond = model(x_t, t, y, force_drop=True)
    # Guided prediction
    eps_guided = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
    return eps_guided

The guidance scale $w$ is a critical hyperparameter that controls the trade-off between sample quality (fidelity to the class) and sample diversity. For ImageNet generation, typical values are $w \in [1.5, 4.0]$, with DiT-XL/2 achieving its best FID at $w = 1.50$.

5.3 Multi-Label and Attribute Embedding Design

While standard class-conditional generation assumes a single categorical label per sample, many real-world scenarios involve multi-label classification (multiple applicable classes per sample) or attribute-based conditioning (structured metadata such as color, shape, or style). This section discusses embedding designs for these more complex label spaces.

5.3.1 Multi-Label Embedding via Summation

When a sample is associated with multiple labels $\{y_1, y_2, \ldots, y_M\}$ (where $M$ varies per sample), a natural approach is to embed each label independently and aggregate via summation or averaging:

\[\mathbf{c}_{\text{multi}} = \frac{1}{M} \sum_{m=1}^{M} \mathbf{e}_{y_m}\]

This approach preserves permutation invariance over the label set and is analogous to bag-of-words representations in NLP.

class MultiLabelEmbedder(nn.Module):
    def __init__(self, num_classes: int, embed_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(num_classes, embed_dim)

    def forward(self, labels: torch.LongTensor, mask: torch.BoolTensor
    ) -> torch.Tensor:
        # labels: [B, max_labels] padded label indices
        # mask:   [B, max_labels] True for valid labels
        emb = self.embedding(labels)                    # [B, max_labels, D]
        emb = emb * mask.unsqueeze(-1).float()          # zero out padding
        count = mask.sum(dim=1, keepdim=True).clamp(min=1)
        return emb.sum(dim=1) / count                   # [B, D] mean pooling

Limitations. Simple summation/averaging treats all labels equally and cannot capture inter-label relationships or compositional semantics (e.g., “red” + “car” should differ from the sum of “red” and “car” independently). For such cases, more sophisticated aggregation is needed.

5.3.2 Multi-Label Embedding via Attention Pooling

A more expressive alternative uses a learnable query that attends over the set of label embeddings:

\[\mathbf{c}_{\text{multi}} = \text{Attention}(\mathbf{q}, \mathbf{E}, \mathbf{E})\]

where $\mathbf{q} \in \mathbb{R}^{1 \times D}$ is a learnable query token and $\mathbf{E} = [\mathbf{e}_{y_1}; \ldots; \mathbf{e}_{y_M}] \in \mathbb{R}^{M \times D}$ is the matrix of individual label embeddings.

class AttentionPoolMultiLabel(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int = 4):
        super().__init__()
        self.query = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)

    def forward(self, label_embs: torch.Tensor, mask: torch.BoolTensor
    ) -> torch.Tensor:
        # label_embs: [B, M, D]
        # mask: [B, M]
        query = self.query.expand(label_embs.size(0), -1, -1)  # [B, 1, D]
        key_padding_mask = ~mask  # True = ignore
        out, _ = self.attn(query, label_embs, label_embs,
                           key_padding_mask=key_padding_mask)
        return out.squeeze(1)  # [B, D]

This mechanism allows the model to learn which labels are most informative and how they interact, producing a richer composite condition vector.

5.3.3 CFG for Multi-Label and Attribute Settings

Extending classifier-free guidance to multi-label settings requires careful design of the null conditioning state. Options include:

Full dropout: Replace the entire label set or attribute vector with a null token, equivalent to standard single-label CFG.
Per-label dropout: Independently drop each label with probability $p_{\text{uncond}}$, enabling fine-grained attribute-level guidance at inference time.
Compositional guidance: Compute guidance independently for each attribute and combine:

\[\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\varnothing + \sum_{a=1}^{A} w_a \cdot \bigl[\boldsymbol{\epsilon}_{v_a} - \boldsymbol{\epsilon}_\varnothing\bigr]\]

where $w_a$ controls the guidance strength for each attribute independently. This enables fine-grained compositional control (e.g., increasing “smiling” while decreasing “eyeglasses”).

6. Text Conditioning Embeddings

Text conditioning is the primary interface between human intent and the generative process in modern diffusion and flow matching models. The quality, semantic fidelity, compositionality, and controllability of generated outputs are all critically dependent on how textual descriptions are encoded and injected into the denoising network. A poorly conditioned model cannot distinguish “a cat sitting on a dog” from “a dog sitting on a cat,” no matter how powerful its backbone architecture might be.

Historically, early diffusion models ²⁰ ²¹ were unconditional or class-conditional. The landmark works of GLIDE ³², DALL·E 2 ³³, Imagen ²⁷, and Latent Diffusion Models (LDM) ² established text-to-image generation as a core paradigm by demonstrating that frozen, pretrained text encoders could provide powerful conditioning signals. Since then, the choice of text encoder, the granularity of the extracted embeddings, the strategy for combining multiple encoders, and the mechanism for injecting these embeddings into the network have all become key design axes that separate state-of-the-art systems from mediocre ones.

A pivotal insight from Imagen ²⁷ was that scaling the text encoder yields larger quality gains than scaling the image generation backbone itself. This finding has been repeatedly validated and has driven the field toward ever-larger, more capable text encoders — from CLIP ViT-L (~125M parameters) to T5-XXL (~4.7B) and, most recently, to multi-billion-parameter LLMs.

This chapter provides a systematic treatment of text conditioning embeddings across major models from 2022 to 2025, covering encoder choices, embedding granularity, multi-encoder strategies, practical considerations around sequence length and masking, negative prompting, prompt weighting, and conditioning injection mechanisms.

6.1 Text Encoder Choices

The text encoder is the front-end of any text-conditioned generative model. It maps a raw text string to a set of dense vector representations. The choice of encoder determines the semantic richness, vocabulary, multilingual capacity, and maximum prompt length available to the downstream diffusion or flow matching model. Three families dominate the literature.

6.1.1 CLIP-Family Encoders

CLIP (Contrastive Language–Image Pretraining) ³⁴ trains a text encoder and an image encoder jointly on hundreds of millions of image–text pairs using a contrastive objective. The text encoder is a causal Transformer that produces both token-level hidden states and a pooled embedding (the hidden state at the [EOS] token, projected into the shared embedding space).

CLIP’s key strength for generative models is that its representation space is visually grounded: text embeddings are trained to be close to image embeddings of corresponding images. This makes them naturally suited for conditioning image generation. However, CLIP’s contrastive training objective focuses on global image–text alignment, which can limit its capacity for fine-grained compositional understanding (e.g., spatial relationships, counting, attribute binding).

Several CLIP variants have been used in practice:

Variant	Params (text)	Embed Dim	Used In
CLIP ViT-L/14 (OpenAI)	~125M	768	SD 1.x, SD 3, FLUX.1
CLIP ViT-H/14 (OpenCLIP)	~354M	1024	SD 2.x
CLIP ViT-bigG/14 (OpenCLIP)	~695M	1280	SDXL, SD 3

Stable Diffusion 1.x ² uses the OpenAI CLIP ViT-L/14 text encoder, extracting the last hidden state as a sequence of 77 token embeddings with dimension 768 and feeding them as keys and values into cross-attention layers of the U-Net.

Stable Diffusion 2.x switched to OpenCLIP ViT-H/14, using the penultimate layer hidden states (a design choice discussed in Section 2.3). This change was motivated by the desire for an open-weight encoder with strong performance.

SDXL ²⁹ was the first major model to use two CLIP encoders simultaneously (OpenCLIP ViT-bigG/14 and OpenAI CLIP ViT-L/14), concatenating their token-level outputs along the channel dimension.

CLIP text encoders share a common limitation: a fixed context window of 77 tokens (including the [BOS] and [EOS] tokens), dictated by their absolute positional embeddings. This severely constrains the length of prompts that can be processed without workarounds.

6.1.2 T5-Family Encoders

T5 (Text-to-Text Transfer Transformer) ³⁵ is an encoder–decoder language model pretrained on massive text corpora with a span-corruption objective. For conditioning generative models, only the encoder portion is used. Unlike CLIP, T5 is trained purely on text without any visual grounding, but it possesses far richer linguistic understanding — it excels at capturing compositional semantics, long-range dependencies, and complex attribute–object relationships.

Imagen ²⁷ was the first major work to advocate for T5 as a text encoder for diffusion models. The authors systematically compared CLIP and T5 variants and found that T5-XXL (4.7B parameters) significantly outperformed all CLIP variants, especially on compositionally complex prompts (e.g., prompts from DrawBench). This was a watershed result: it showed that language understanding ability, not visual grounding, was the bottleneck for text-conditioned generation.

Key T5 variants used in generative models:

Variant	Params	Hidden Dim	Max Tokens	Used In
T5-XL	3B	2048	512	eDiff-I ³⁶
T5-XXL	4.7B	4096	512	Imagen ²⁷, PixArt-α ³⁷, SD 3 ³, FLUX.1 ⁴

PixArt-α ³⁷ demonstrated that a T5-XXL–conditioned DiT could achieve competitive quality with far less training compute than Imagen or DALL·E 2, partly because the rich T5 embeddings reduce the burden on the generative backbone to learn language understanding. PixArt-Σ ³⁸ further extended the T5 context window to 300 tokens to support longer, more detailed prompts.

T5 encoders produce token-level embeddings but do not have a natural pooled embedding in the same sense as CLIP (no contrastive [EOS] projection). When a global embedding is needed (e.g., for adaLN conditioning), a separate mechanism must be used — either mean pooling, a learned projection, or reliance on a companion CLIP encoder for the pooled vector.

A practical consideration: T5-XXL’s 4.7B parameters impose a significant VRAM cost (~10 GB in float16). Several models (e.g., SD 3, FLUX.1 distilled variants) explore strategies to optionally drop the T5 encoder at inference time, relying solely on CLIP when T5 is unavailable.

6.1.3 LLM-Based Text Encoders

Starting in 2024, a clear trend emerged: using the hidden states of a general-purpose large language model as the text encoder. The rationale is straightforward — modern LLMs possess world knowledge, reasoning ability, and compositional language understanding that far exceeds both CLIP and T5.

Model	Text Encoder	Params	Key Detail
Hunyuan-DiT ³⁹	mT5-XXL + bilingual CLIP	4.7B + CLIP	Multilingual (Chinese/English)
Kolors ⁴⁰	ChatGLM-6B	6B	Chinese-optimized LLM
Lumina-T2X ¹⁸	Gemma-2B	2B	Decoder-only LLM encoder
Sana ⁴¹	Gemma-2-2B	2B	Decoder-only, small & efficient
DALL·E 3 ⁴²	T5-XXL (+ CLIP reranking)	4.7B	Improved via recaptioning

A key design question when using decoder-only LLMs as text encoders is how to extract useful embeddings from an architecture designed for next-token prediction rather than bidirectional encoding:

Last hidden states from all tokens: Treat the LLM like T5 — take the hidden states at every token position from a chosen layer and use them as the token-level embedding sequence. However, because decoder-only LLMs use causal attention, early tokens do not attend to later tokens, which means the representation at position i only encodes information from tokens 0 through i. This is fundamentally different from the bidirectional attention in T5.
Bidirectional attention injection: Some works (e.g., Sana ⁴¹) modify the LLM’s attention mask to be bidirectional during encoding, allowing all positions to attend to all others. This breaks the causal assumption but produces richer token-level representations for conditioning.
Last-token pooling: For decoder-only models, the last token’s hidden state is the most informationally rich (it has attended to all preceding tokens), making it a natural candidate for a pooled embedding.

Lumina-T2X ¹⁸ uses Gemma-2B and demonstrated that even a 2B-parameter decoder-only LLM can serve as an effective text encoder when combined with appropriate flag tokens and attention modifications. Sana ⁴¹ further validated Gemma-2-2B as an efficient text encoder, showing that it can replace the much larger T5-XXL while maintaining generation quality and significantly reducing memory footprint.

6.1.4 Summary and Trade-offs

Criterion	CLIP	T5-XXL	LLM (e.g., Gemma, LLaMA)
Visual grounding	✅ Strong (contrastive)	❌ None	❌ None (unless CLIP-augmented)
Compositional semantics	⚠️ Limited	✅ Strong	✅ Very strong
Pooled global embedding	✅ Native	❌ Not native	⚠️ Last-token or mean-pool
Max sequence length	77 tokens	512 tokens	2048+ tokens
Multilingual	⚠️ Weak	✅ (mT5)	✅ (multilingual LLMs)
Parameter cost	Low (~125M–695M)	High (~4.7B)	High (~2B–8B+)

The field has converged on a consensus: no single encoder is optimal alone. CLIP provides visually aligned global semantics and a clean pooled embedding; T5/LLMs provide rich compositional token-level features. The best current systems (SD 3 ³, FLUX.1 ⁴) combine both families, as discussed in Section 3.

6.2 Granularity of Text Conditioning Embeddings

A text encoder produces representations at multiple levels of granularity. How these representations are extracted and used has significant implications for the model’s ability to faithfully render prompt details.

6.2.1 Token-Level (Sequence) Embeddings

Token-level embeddings are the hidden states at each token position from a chosen layer of the text encoder. For a prompt tokenized into L tokens, the text encoder produces a matrix $\mathbf{C} \in \mathbb{R}^{L \times d}$, where $d$ is the hidden dimension.

These embeddings carry local, per-token semantic information and are the primary vehicle for fine-grained conditioning. In U-Net–based architectures (SD 1.x/2.x, SDXL), they serve as keys and values in cross-attention ²:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

where $Q$ comes from the image features and $K = W_K \mathbf{C}$, $V = W_V \mathbf{C}$ come from the text token embeddings. This mechanism allows each spatial location in the image to attend selectively to relevant words in the prompt.

In Transformer-based architectures (DiT ¹, MMDiT ³), token-level text embeddings are either used in cross-attention layers or concatenated with image patch tokens along the sequence dimension for joint self-attention.

6.2.2 Pooled (Global) Embeddings

Pooled embeddings are a single vector $\mathbf{c}_\text{pool} \in \mathbb{R}^{d_p}$ summarizing the entire prompt. In CLIP ³⁴, this is the hidden state at the [EOS] token position, projected through a learned linear layer into the contrastive embedding space. It captures the overall semantic meaning of the prompt but discards fine-grained token-level detail.

Pooled embeddings are typically used for global conditioning via mechanisms that do not have a sequence dimension:

Timestep conditioning: The pooled embedding is concatenated with (or added to) the timestep embedding and passed through an MLP to produce parameters for adaptive layer normalization (adaLN) ¹.
Vector conditioning: In SDXL ²⁹, the pooled text embedding is concatenated with other conditioning vectors (original image size, crop coordinates, target size) and projected through a shared MLP.

In SD 3 ³ and FLUX.1 ⁴, the pooled CLIP embedding is combined with the sinusoidal timestep embedding and fed into adaLN-Zero or adaLN-Single layers to modulate both the text and image streams. This provides a coarse, global semantic “context” that influences every layer of the network.

T5 and LLM encoders typically do not provide a natural pooled embedding. When a pooled embedding is needed alongside T5, it is supplied by a companion CLIP encoder. This is exactly the approach taken by SD 3 and FLUX.1.

6.2.2 Pooled (Global) Embeddings

Not all layers of a text encoder are equally useful. The final layer of a CLIP text encoder is optimized to produce features for the contrastive loss — these features are highly compressed and specialized for global image–text matching. The penultimate layer (second-to-last) often retains richer, more diverse token-level representations that have not yet been “squeezed” through the final contrastive projection.

SD 1.x ²: Uses the last hidden layer of CLIP ViT-L/14.
SD 2.x: Switches to the penultimate layer of OpenCLIP ViT-H/14.
SDXL ²⁹: Uses the penultimate layer for both CLIP encoders.
SD 3 ³ and FLUX.1 ⁴: Also use penultimate-layer features from CLIP encoders.

For T5 encoders, the last hidden layer of the encoder is used almost universally, as T5’s span-corruption pretraining objective does not create the same “compression bottleneck” at the final layer that CLIP’s contrastive loss does.

For LLM-based encoders, the optimal layer varies. Some works extract hidden states from an intermediate layer (e.g., layer 24 of a 32-layer model) rather than the final layer, as the final layers of decoder-only LLMs are increasingly specialized for next-token prediction logits.

6.2.3 Hybrid Granularity: Token-Level + Pooled

Modern high-performance models use both granularity levels simultaneously:

Token-level embeddings → cross-attention or joint-attention for spatially-resolved, fine-grained conditioning.
Pooled embeddings → adaLN or vector conditioning for global semantic context.

This dual-granularity approach first appeared prominently in SDXL ²⁹ and has become standard in SD 3 ³, FLUX.1 ⁴, and most subsequent models. The rationale is clear: pooled embeddings efficiently convey the global “theme” of the prompt (e.g., art style, overall scene type), while token-level embeddings provide the per-concept details needed for accurate composition.

6.3 Dual-Encoder and Multi-Stream Text Conditioning

A defining trend from 2023 onward is the use of multiple text encoders in a single model, with each encoder contributing complementary capabilities.

6.3.1 SDXL: Dual CLIP Encoders

SDXL ²⁹ was the first widely adopted model to use two text encoders:

OpenCLIP ViT-bigG/14: produces token embeddings of shape $77 \times 1280$ and a pooled embedding of dimension 1280.
OpenAI CLIP ViT-L/14: produces token embeddings of shape $77 \times 768$ and a pooled embedding of dimension 768.

Token-level fusion: The token embeddings from both encoders are concatenated along the channel dimension to produce a combined sequence of shape $77 \times 2048$. This concatenated embedding is projected and used as keys/values in cross-attention.

Pooled fusion: The two pooled embeddings are concatenated into a vector of dimension $1280 + 768 = 2048$, which is further concatenated with micro-conditioning vectors (original size, crop position, target size) and processed through an MLP.

The dual-CLIP approach leverages the complementary representations of two encoders trained on different data distributions (OpenAI CLIP’s curated dataset vs. OpenCLIP’s LAION-based training), improving prompt adherence across a wider range of concepts.

6.3.2 Stable Diffusion 3: Triple Encoder

Stable Diffusion 3 ³ takes the multi-encoder approach to its logical extreme by using three text encoders:

CLIP ViT-L/14 (OpenAI): 77 × 768 tokens, 768-dim pooled
CLIP ViT-bigG/14 (OpenCLIP): 77 × 1280 tokens, 1280-dim pooled
T5-XXL: up to 256 × 4096 tokens (some configs extend to 512)

Token-level fusion in MMDiT: The two CLIP token sequences are zero-padded to a common channel dimension and concatenated along the channel dimension, yielding a 77 × d sequence. Separately, the T5 token sequence (up to 256 tokens × 4096 dims) is linearly projected to dimension d. Both are then concatenated along the sequence dimension to form the full text conditioning sequence, which participates in joint self-attention with image patch tokens in the MMDiT blocks.

Pooled embedding: The CLIP-L pooled embedding (768-d) and CLIP-G pooled embedding (1280-d) are concatenated to form a 2048-d vector. This is combined with the timestep embedding and used to compute adaLN modulation parameters. Note: T5-XXL does not contribute a pooled embedding — it provides only the token-level sequence.

A notable finding from the SD 3 paper: dropping the T5 encoder at inference time causes a moderate quality decrease (primarily in complex compositional prompts and text rendering) but the model remains functional using only the two CLIP encoders. This enables a memory-efficient deployment mode.

6.3.3 FLUX.1: Dual Encoder

FLUX.1 ⁴ simplifies SD 3’s triple-encoder setup to two encoders:

CLIP ViT-L/14: provides a pooled embedding (768-d) used with the timestep for adaLN modulation.
T5-XXL: provides token-level embeddings (up to 512 × 4096), linearly projected and used as the text token sequence in joint attention.

FLUX.1’s architecture uses a hybrid of double-stream (MMDiT-style) blocks and single-stream (DiT) blocks. In double-stream blocks, text and image tokens have separate parameter sets for attention projections but share the attention computation (joint self-attention). In single-stream blocks, text and image tokens are simply concatenated and processed through a standard transformer block.

This design cleanly separates the roles: CLIP provides the global conditioning vector, and T5 provides the detailed conditioning sequence. The removal of the second CLIP (CLIP-G) relative to SD 3 simplifies the pipeline without significant quality loss, suggesting that T5’s rich token-level features subsume much of what CLIP-G’s token embeddings provided.

6.3.4 Hunyuan-DiT: Bilingual Multi-Encoder

Hunyuan-DiT ³⁹ targets bilingual Chinese–English generation and uses:

mT5-XXL: a multilingual T5 variant for token-level embeddings.
A bilingual CLIP model: fine-tuned for both Chinese and English, providing pooled embeddings.

This illustrates how multi-encoder strategies naturally accommodate multilingual requirements: the multilingual T5 provides token-level understanding in both languages, while the bilingual CLIP provides globally aligned visual–semantic embeddings.

6.3.5 Design Principles for Multi-Encoder Systems

Across these models, several principles emerge:

Semantic complementarity: CLIP encoders provide visually grounded, globally aligned representations; T5/LLM encoders provide rich compositional and linguistic representations. Combining them covers both axes.
Granularity separation: Contrastive encoders (CLIP) naturally produce strong pooled embeddings for global conditioning. Generative/masked-language encoders (T5, LLMs) naturally produce strong token-level embeddings for local conditioning.
Graceful degradation: Multi-encoder systems can be designed to remain functional when one encoder is dropped (e.g., SD 3 without T5 ³), enabling flexible deployment across hardware tiers.
Fusion strategy: Token-level embeddings from different encoders are fused either by channel-wise concatenation (same sequence length, different feature dims) or by sequence-wise concatenation (different sequence lengths, same feature dim after projection).

6.4 Sequence Length, Padding, and Attention Mask Effects

In this section, we discuss several factors that affect text embedding.

6.4.1 Sequence Length Constraints

Each text encoder imposes a maximum sequence length:

CLIP text encoders ³⁴: Hard limit of 77 tokens (75 usable tokens + [BOS] + [EOS]), due to fixed absolute positional embeddings. This is a fundamental architectural constraint.
T5-XXL ³⁵: Maximum of 512 tokens (determined by the relative positional bias range used during pretraining), though models typically use shorter limits (128, 256, or 512) for efficiency.
LLM-based encoders: Typically support 2048+ tokens, but in practice, generative models cap the text input length for computational reasons (e.g., 256 or 512 tokens).

The 77-token CLIP limit has been a persistent practical limitation. For complex, detailed prompts — especially those describing multiple objects, spatial relationships, styles, and negative constraints — 75 usable tokens are often insufficient.

6.4.2 Long-Prompt Workarounds for CLIP

Several community-driven and research-driven workarounds exist:

Prompt truncation: The simplest approach — tokens beyond position 77 are simply dropped. This is the default behavior in most pipelines and silently loses information.
Chunked encoding: Pioneered in community tools, the prompt is split into 77-token chunks, each encoded independently, and the resulting embedding sequences are concatenated. While pragmatic, each chunk lacks cross-chunk attention, so inter-chunk semantic coherence is limited.
Reliance on T5/LLM: In models with a secondary T5 or LLM encoder (SD 3 ³, FLUX.1 ⁴), the CLIP encoder handles the first 77 tokens while the T5/LLM encoder processes the full prompt. The long-range compositional burden falls on T5/LLM.

For T5-based models, PixArt-Σ ³⁸ extended the effective sequence length to 300 tokens and demonstrated significant improvement in generation quality for detailed prompts.

6.4.3 Padding Strategy

When a prompt is shorter than the maximum sequence length, the remaining positions must be padded. The padding strategy interacts with the attention mechanism:

Zero-padding with attention mask: Pad the embedding sequence with zero vectors and apply a binary attention mask in cross-attention or joint-attention layers so that image features do not attend to padding positions. This is the correct approach and is used in most well-implemented systems.
Zero-padding without attention mask: Pad with zero vectors but do not mask. The attention softmax will still assign some weight to padding positions (since $\exp(q \cdot 0 / \sqrt{d_k}) = 1$, not 0). This introduces a mild bias — padding tokens act as a learnable “background” signal. Some early implementations (including early Stable Diffusion code) did not properly mask padding tokens. Models trained this way learn to be robust to this bias, but it is technically suboptimal.
[PAD] token embedding + mask: Some tokenizers have an explicit padding token whose learned embedding is used for padding positions, combined with an attention mask. This is standard for T5 ³⁵.

6.4.4 Attention Mask Effects

The attention mask determines which text positions the model can attend to. Its effects are subtle but significant:

Proper masking prevents information leakage from padding, ensuring that the model’s behavior is invariant to prompt length padding.
In joint-attention architectures (MMDiT ³, FLUX.1 ⁴), the attention mask must be carefully constructed to handle the asymmetry between text and image tokens. Image tokens should attend to all image tokens and all non-padding text tokens. Text tokens should attend to all non-padding text tokens and all image tokens. Padding text positions should be masked out entirely.
In classifier-free guidance ⁴³ (see Section 5), the unconditional forward pass often uses a fully padded text sequence with appropriate masking, or a special null embedding.

In practice, incorrect attention masking is a common source of subtle quality degradation, especially when prompt lengths vary significantly within a batch during training.

6.5 Negative Prompting and Null Prompt Embeddings

6.5.1 Classifier-Free Guidance and the Null Embedding

Classifier-Free Guidance (CFG) ⁴³ is the standard technique for strengthening text conditioning at inference time. During training, the text condition is randomly replaced with a null condition (typically with probability 5–20%). At inference, the model predicts both the conditional output $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c})$ and the unconditional output $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing)$, and the final prediction is:

\[\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing) + w \cdot \left[\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing)\right]\]

where $w > 1$ is the guidance scale. The null embedding $\varnothing$ can be implemented as:

Zero vector: All text embeddings set to zero. Simple but may not align with the learned embedding distribution.
Empty string embedding: The text encoder encodes an empty string "", producing [BOS][EOS] followed by padding. This is the most common approach and ensures the null embedding lies within the natural manifold of the encoder.
Learned null embedding: A trainable parameter that is optimized during training to represent “no text condition.” Used in some GLIDE ³² and Imagen ²⁷ variants.
Dropped embedding with mask: The text sequence is replaced with a zero-length sequence and the attention mask is set to all zeros, effectively disabling cross-attention.

The choice of null embedding is not merely an implementation detail — it defines the anchor point from which classifier-free guidance extrapolates. The quality of the null embedding directly affects the direction and stability of the guidance vector.

6.5.2 Negative Prompting

Negative prompting is a user-facing technique that repurposes the unconditional prediction slot in CFG. Instead of using the null embedding $\varnothing$, a negative prompt embedding $\mathbf{c}_\text{neg}$ is substituted:

\[\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{neg}) + w \cdot \left[\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{pos}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{neg})\right]\]

The geometric interpretation is intuitive: the guidance vector points away from the negative prompt embedding and toward the positive prompt embedding in the score function space. By encoding undesired attributes (e.g., “blurry, low quality, distorted, watermark”) into $\mathbf{c}_\text{neg}$, the generation is steered away from those attributes.

From an embedding perspective, negative prompting works because:

The text encoder maps the negative prompt to a region of embedding space associated with undesired visual features.
The guidance vector $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{pos}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{neg})$ is larger and more directionally specific than $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}_\text{pos}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing)$, giving the guidance more leverage.
The technique is training-free — it requires no modification to the model, only a change to the inference procedure.

6.5.3 Negative Prompting in Flow Matching Models

Flow matching models (SD 3 ³, FLUX.1 ⁴) based on rectified flows have a different training formulation than DDPM-style diffusion, but CFG applies analogously. The model predicts a velocity field $\mathbf{v}_\theta(\mathbf{x}_t, t, \mathbf{c})$ rather than noise, and the guided velocity is:

\[\tilde{\mathbf{v}} = \mathbf{v}_\theta(\mathbf{x}_t, t, \mathbf{c}_\text{neg}) + w \cdot \left[\mathbf{v}_\theta(\mathbf{x}_t, t, \mathbf{c}_\text{pos}) - \mathbf{v}_\theta(\mathbf{x}_t, t, \mathbf{c}_\text{neg})\right]\]

Notably, FLUX.1 ⁴ was trained with guidance distillation, where the guidance scale is baked into the model as an additional conditioning input, eliminating the need for two forward passes at inference. In this setting, negative prompting in its traditional form is not directly applicable to the distilled model (FLUX.1-schnell), though the non-distilled model (FLUX.1-dev) still supports it.

6.6 Prompt Weighting and Token Emphasis from an Embedding Perspective

Users often want to control the relative importance of different parts of a prompt. For example, in “a portrait of a woman with (blue eyes:1.5) and blonde hair,” the user wants “blue eyes” to be more strongly expressed. This requires manipulating the text embeddings at the token level.

6.6.1 Attention Re-weighting

The most common approach manipulates the cross-attention scores between image and text tokens. For each cross-attention layer, the standard attention weights are:

\[A_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{j'} \exp(q_i \cdot k_{j'} / \sqrt{d_k})}\]

Prompt weighting modifies this by scaling the attention logits for specific tokens before the softmax:

\[A_{ij} = \frac{\exp(w_j \cdot q_i \cdot k_j / \sqrt{d_k})}{\sum_{j'} \exp(w_{j'} \cdot q_i \cdot k_{j'} / \sqrt{d_k})}\]

where $w_j$ is the user-specified weight for token $j$ (default 1.0). A weight $w_j > 1$ increases the attention to token $j$; $w_j < 1$ decreases it. This approach is training-free and operates directly on the attention computation.

6.6.2 Embedding-Space Scaling

An alternative approach scales the text embeddings themselves rather than the attention logits:

\[\mathbf{c}'_j = w_j \cdot \mathbf{c}_j\]

where $\mathbf{c}_j$ is the token-level embedding at position $j$. Scaling the embedding directly amplifies (or attenuates) the “signal” that a particular token contributes as a key and value in cross-attention. This has a different mathematical effect than logit scaling: it scales both the key (affecting attention distribution) and the value (affecting the weighted sum), resulting in a more aggressive emphasis.

6.6.3 Prompt Interpolation and Blending

Beyond per-token weighting, users sometimes blend entire prompt embeddings:

\[\mathbf{C}_\text{blend} = \alpha \cdot \mathbf{C}_A + (1 - \alpha) \cdot \mathbf{C}_B\]

where $\mathbf{C}_A$ and $\mathbf{C}_B$ are the token-level embedding matrices for two different prompts. This produces images that semantically interpolate between the two prompts. The interpolation is meaningful because the text encoders (especially CLIP ³⁴) produce embeddings in a relatively smooth semantic space.

This technique extends to per-step prompt interpolation (e.g., use prompt A for early denoising steps that establish composition, and prompt B for later steps that refine details), sometimes called prompt scheduling or prompt switching.

6.6.4 Limitations

Prompt weighting operates in the embedding space of a frozen text encoder, which means it cannot introduce concepts that are not already representable in that space. Additionally, because tokens in a Transformer interact with each other through self-attention within the text encoder, scaling one token’s output embedding does not cleanly isolate that concept — neighboring tokens’ representations are already entangled with it.

6.7 Text Conditioning Injection Mechanisms

How text embeddings, once computed, are fed into the denoising network is a critical design choice that directly affects what information from the text condition the model can leverage.

6.7.1 Cross-Attention (U-Net Models)

The classic approach, introduced in Latent Diffusion Models ² and used in SD 1.x, SD 2.x, SDXL ²⁹, and many other U-Net–based models:

Token-level text embeddings $\mathbf{C} \in \mathbb{R}^{L \times d_c}$ are projected to keys and values: $K = \mathbf{C} W_K$, $V = \mathbf{C} W_V$.
Image features at each spatial resolution provide the queries: $Q = \mathbf{h} W_Q$.
Cross-attention is applied at multiple resolutions within the U-Net, typically in the middle and deeper blocks.

Cross-attention allows each spatial location in the image to selectively attend to relevant tokens, forming an implicit spatial–semantic alignment. Studies (e.g., Prompt-to-Prompt ⁴⁴) have shown that the cross-attention maps reveal interpretable spatial layouts: attention to the word “cat” concentrates on the region where the cat is being generated.

6.7.2 Adaptive Layer Normalization (adaLN / adaLN-Zero)

Introduced by DiT ¹ for class-conditional generation and extended to text conditioning in subsequent works:

A global conditioning vector $\mathbf{c}_\text{global}$ (typically the sum or concatenation of the timestep embedding and the pooled text embedding) is passed through an MLP to produce per-layer scale ($\gamma$) and shift ($\beta$) parameters: $\mathbf{h}' = \gamma \cdot \text{LayerNorm}(\mathbf{h}) + \beta$
In adaLN-Zero, additional gate parameters ($\alpha$) are produced, initializing the residual contributions to zero at the start of training for stable optimization.

adaLN efficiently broadcasts global semantic information to every token in the image sequence. It is used for the pooled text embedding in SD 3 ³, FLUX.1 ⁴, PixArt-α ³⁷, and most DiT-based models. However, it cannot convey per-token, per-position information — it provides the same modulation to all image patches.

6.7.3 Joint Self-Attention (MMDiT / Single-Stream DiT)

SD 3’s MMDiT ³ pioneered the multi-modal DiT approach:

Text token embeddings and image patch embeddings are treated as two separate “streams” with independent linear projections for Q, K, V.
Within each MMDiT block, both streams are concatenated along the sequence dimension, and a standard self-attention operation is performed over the combined sequence: $\text{Attention}([Q_\text{img}; Q_\text{txt}], [K_\text{img}; K_\text{txt}], [V_\text{img}; V_\text{txt}])$
After attention, the combined output is split back into image and text portions, and each passes through its own feed-forward network.

This joint attention allows bidirectional information flow between text and image tokens — text tokens can attend to image tokens and vice versa. This is more powerful than cross-attention (where information flows only from text to image) and enables the model to dynamically refine its “understanding” of the text condition as it processes the image.

FLUX.1 ⁴ extends this with a hybrid architecture: the first ~19 blocks use double-stream (MMDiT-style) layers where text and image have separate MLPs, followed by ~38 single-stream blocks where text and image tokens are simply concatenated and processed through a unified transformer block (shared projections, shared MLP). This transition from separated to unified processing mirrors a progressive fusion of modalities.

6.7.4 In-Context Conditioning (Prepend)

Some architectures simply prepend text token embeddings to the image token sequence and process everything with standard self-attention:

\[\text{Input sequence} = [\mathbf{c}_1, \mathbf{c}_2, \ldots, \mathbf{c}_L, \mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_N]\]

This is the simplest approach and treats text tokens identically to image tokens. It is used in some early DiT variants and in PixArt-α ³⁷ (which uses cross-attention, but many subsequent DiT works adopt prepend-style conditioning). The downside is that it does not allow modality-specific processing (e.g., separate normalization or MLPs for text vs. image), which can limit performance.

6.7.5 Summary and Comparison

Mechanism	Text→Image	Image→Text	Global Info	Per-token Info	Separate Params
Cross-Attention	✅	❌	❌	✅	✅ (K,V projections)
adaLN	✅	❌	✅	❌	✅ (MLP)
Joint Self-Attention (MMDiT)	✅	✅	❌	✅	✅ (separate Q,K,V)
Single-Stream Prepend	✅	✅	❌	✅	❌ (shared)

State-of-the-art models combine adaLN (for pooled/global conditioning) with joint attention (for token-level conditioning), achieving both global and local text conditioning in a unified architecture.

6.8 Text Encoder Training Strategy and Embedding Preprocessing

6.8.1 Frozen vs. Fine-Tuned Text Encoders

The dominant approach is to freeze the text encoder during diffusion model training. The text encoder serves as a fixed feature extractor, and all adaptation happens in the denoising network (via learned linear projections on the text embeddings). This is computationally efficient, prevents catastrophic forgetting of the text encoder’s pretrained knowledge, and enables pre-computation of text embeddings — embeddings can be computed once for all training captions and cached to disk, eliminating the text encoder from the training loop entirely.

However, some methods fine-tune or adapt the text encoder:

DreamBooth ⁴⁵: Optionally fine-tunes the text encoder alongside the U-Net to better bind a new concept identifier to its visual appearance.
Textual Inversion ⁴⁶: Learns a new embedding vector for a novel concept token while keeping the entire model frozen (see Section 9.3).
LoRA on text encoder ⁴⁷: A common practice in community fine-tuning is to apply LoRA (Low-Rank Adaptation) to the text encoder alongside the U-Net/DiT, with a much lower learning rate for the text encoder. This allows subtle adaptation of the text embeddings to the fine-tuning domain.

6.8.2 Embedding Projection and Normalization

The raw text embeddings from a pretrained encoder rarely match the expected input distribution of the denoising network. A learned linear projection (or MLP) is used to map the text encoder’s hidden dimension to the denoising network’s internal dimension:

\[\mathbf{C}' = \mathbf{C} W_\text{proj} + b_\text{proj}\]

In SD 3 ³ and FLUX.1 ⁴, each text encoder has its own projection layer. For the pooled embedding, a separate MLP processes the concatenated CLIP pooled embeddings into the timestep conditioning space.

6.8.3 Pre-Computation and Caching

Because text encoders are frozen and deterministic, text embeddings can be pre-computed once and stored alongside the training data. This is standard practice for large-scale training:

PixArt-α ³⁷ pre-computes T5-XXL embeddings for all training captions.
Sana ⁴¹ pre-computes Gemma embeddings.

This yields substantial training speedups (the text encoder is often 30–50% of the total parameter count) and enables training on hardware that could not fit both the text encoder and the diffusion model simultaneously.

The tradeoff is inflexibility: if the training captions are augmented or modified (e.g., random caption dropout, prompt augmentation with synonyms), the embeddings must be recomputed.

6.9 Additional Topics

6.9.1 Text Conditioning for Video Generation

The text conditioning paradigms described above extend directly to video diffusion and flow matching models:

Sora ⁴⁸, CogVideoX [^cogvideo]: These video generation models use CLIP and/or T5 text encoders with the same token-level + pooled embedding paradigm. The text embeddings are shared across all temporal frames, providing a fixed semantic anchor for the entire video.
Movie Gen ⁴⁹: Uses a combination of text encoders (MetaCLIP and a UL2 encoder) for conditioning a flow matching video model.
Some video models introduce temporal prompt conditioning, where different text embeddings are associated with different temporal segments, enabling prompt-driven temporal transitions.

6.9.2 Re-Captioning and Synthetic Captions

A significant recent trend is to re-caption training data using a large vision–language model (VLM) before training the diffusion model. This directly affects text conditioning embeddings because the quality and style of captions determine which regions of the text embedding space are well-covered by training data.

DALL·E 3 ⁴² demonstrated that training on synthetic, detailed captions (generated by a fine-tuned captioning model) dramatically improves prompt following, because the synthetic captions are more descriptive and precise than the noisy, short alt-text captions found in web-scraped datasets.
PixArt-α ³⁷ uses LLaVA to recaption SA-1B dataset images.

From an embedding perspective, re-captioning shifts the training distribution of text embeddings from a sparse, noisy coverage of embedding space (web alt-text) to a denser, more uniform coverage that better represents the visual content. This is why re-captioned models exhibit markedly better prompt adherence.

6.9.3 Textual Inversion and Embedding-Space Manipulation

Textual Inversion ⁴⁶ learns a new embedding vector $\mathbf{v}_{\star}$ for a pseudo-word $S_{\star}$ that represents a user-specified concept (e.g., a specific object or style). The rest of the model is frozen; only $\mathbf{v}_{\star}$ is optimized to reconstruct the concept from a few images. At inference, the pseudo-word is composed with natural language (e.g., “a painting of $S_*$ in the style of Van Gogh”), and its learned embedding seamlessly integrates into the text encoder’s token sequence.

This demonstrates that the text embedding space of frozen encoders is sufficiently expressive and smooth to accommodate entirely novel concepts through local optimization, providing evidence that text embeddings form a well-structured semantic manifold.

6.9.4 Classifier-Free Guidance Alternatives

Recent work has explored alternatives to CFG that modify how text conditioning is used:

Autoguidance ⁵⁰: Uses a lower-quality (smaller) model for the unconditional prediction instead of the same model with null text, decoupling quality guidance from prompt guidance.
Guidance distillation (as in FLUX.1-schnell ⁴): The guidance scale is provided as an input to the model, which is trained to directly output the guided prediction in a single forward pass.

These approaches change the role of null/negative text embeddings, and some eliminate the need for them entirely.

6.10 Summary and Outlook

The design of text conditioning embeddings has evolved from a single CLIP encoder feeding cross-attention layers (SD 1.x ², 2022) to sophisticated multi-encoder systems combining visually grounded CLIP embeddings, linguistically rich T5/LLM embeddings, pooled global vectors, token-level sequences, and joint attention mechanisms (SD 3 ³, FLUX.1 ⁴, 2024–2025).

Key takeaways:

Text encoder capacity matters more than generative backbone scale for prompt adherence — Imagen’s ²⁷ key finding, repeatedly validated.
Multi-encoder approaches that combine CLIP (for visual grounding and pooled embeddings) with T5 or LLMs (for compositional token-level embeddings) represent the current best practice.
Dual-granularity conditioning — pooled embeddings via adaLN for global semantics, token-level embeddings via attention for local details — is now standard.
Joint attention mechanisms (MMDiT ³) that allow bidirectional text–image interaction outperform unidirectional cross-attention for compositional generation.
LLM-based text encoders ⁴¹ ¹⁸ are an emerging replacement for T5, offering better language understanding at comparable or lower computational cost.

Open challenges and future directions:

Dynamic text conditioning: Adapting text embeddings during the denoising process (e.g., different emphasis at different noise levels) rather than using static embeddings throughout.
Instruction-following: Moving beyond descriptive captions to instructional or conversational text conditioning (e.g., “make the sky more dramatic”).
Unified multimodal encoders: Using a single multimodal model (e.g., a VLM) as both the text encoder and image understanding module, enabling richer conditioning that goes beyond pure text.
Efficiency: Reducing the memory and compute footprint of large text encoders through quantization, distillation, or more parameter-efficient architectures, without sacrificing embedding quality.

7. Image Embeddings in Diffusion and Flow Matching Generative Models

Text-to-image diffusion models ² ²⁷ and flow matching models ³ have demonstrated remarkable ability to generate high-quality images from text descriptions. However, natural language is inherently ambiguous and limited in its capacity to convey fine-grained visual details — the precise texture of a fabric, the exact pose of a person, the specific identity of a face, or the spatial layout of a scene. Image embeddings bridge this gap by providing a dense, information-rich conditioning signal that captures aspects of visual content that text alone cannot express.

From an architectural standpoint, leveraging image embeddings introduces a core design question: how should visual information from a conditioning image be encoded, and where should it enter the denoising network? The answer depends critically on what type of visual information the image carries (semantic content vs. spatial structure vs. identity) and what level of control the downstream task requires. Over the past several years, the field has converged on a rich toolkit of solutions — from CLIP-based global embeddings injected via cross-attention ⁵¹, to multi-scale spatial features routed through zero-convolution side branches ⁵², to full reference networks that augment self-attention ⁵³.

7.1 Taxonomy of Image Conditions

Not all image conditions are created equal. A natural photograph carries rich semantic and textural information; a Canny edge map carries only binary contour structure. This distinction has profound implications for how the image should be encoded and injected. Below we categorize the major types of image conditions and their roles.

7.1.1 Natural / Reference Images

A natural (RGB) image used as a condition typically provides semantic, stylistic, or identity-level guidance. Use cases include:

Image variation: generating diverse outputs that share the semantic content of a reference image. DALL·E 2 ³³ conditions its decoder on a CLIP image embedding to produce variations of a given image. Versatile Diffusion ⁵⁴ jointly models text-to-image and image-to-image flows.
Style transfer: transferring the artistic style, color palette, or mood from a reference. IP-Adapter ⁵¹ enables this by injecting CLIP image features via cross-attention alongside text.
Subject-driven generation: generating novel scenes containing a specific subject (person, object, pet). DreamBooth ⁴⁵ fine-tunes the model on a few subject images; Textual Inversion [^textualinversion] learns a special token embedding; BLIP-Diffusion ⁵⁵ and ELITE ⁵⁶ encode subject appearance via Q-Former or mapping networks without per-subject fine-tuning.
Identity preservation: generating images of a specific person. InstantID ⁵⁷ and PhotoMaker ⁵⁸ use dedicated face encoders to extract identity features from face photographs.
Exemplar-based editing: Paint by Example ⁵⁹ replaces a masked region with content semantically matching a reference image, using a CLIP image embedding in place of the text embedding.
Video generation from a still: Stable Video Diffusion (SVD) ⁶⁰ takes a single image and animates it into a video clip, conditioning on both the CLIP embedding and the VAE-encoded latent of the first frame.

The key property of natural image conditions is that they carry holistic, high-dimensional visual information — far richer than any text prompt — and different downstream tasks demand different subsets of this information (global semantics, local details, or identity-specific features).

7.1.2 Depth Maps

A depth map is a single-channel image where pixel intensity encodes the distance from the camera to the scene surface. Depth maps provide 3D structural guidance — they tell the model where objects are in space and how they occlude each other, without prescribing texture, color, or fine detail.

Depth maps are typically extracted from natural images using monocular depth estimation models such as MiDaS ⁶¹ or Depth Anything ⁶². They are widely used as conditions in ControlNet ⁵² and T2I-Adapter ⁶³ to guide the spatial layout of generated images while allowing full creative freedom in appearance.

7.1.3 Human Pose and Skeleton Maps

Pose maps encode the articulated structure of human bodies as joint locations connected by limbs. The most common format is an OpenPose skeleton ⁶⁴, which renders 18 or 25 body keypoints and their connections as colored lines on a black background. Variants include:

Body-only skeletons: 18-keypoint body pose.
Body + hand + face: high-fidelity pose including finger and facial landmark positions (up to 135 keypoints).
DensePose: a UV-mapped surface representation of the human body, providing a dense correspondence between 2D image pixels and 3D body surface coordinates.

Pose maps are essential for controllable human image generation, enabling precise specification of body posture, gesture, and facial expression. They are heavily used in human-centric generation ⁵² and character animation ⁵³.

7.1.4 Edge and Contour Maps

Edge maps capture the boundary structure of a scene at different levels of abstraction:

Canny edges: a classical gradient-based edge detector that produces thin, binary contours. Sensitive to hyperparameters (low/high thresholds) and noise, but captures fine structural detail.
HED (Holistically-Nested Edge Detection): a learned edge detector that produces soft, multi-scale boundary maps. Produces smoother, more perceptually meaningful edges than Canny.
Scribbles and sketches: hand-drawn or algorithmically simplified contours that convey rough shape without precise detail.

Edge conditions constrain the outline and shape structure of generated objects while leaving interior textures, colors, and fine details to the model’s generation process. They are useful for creative workflows where an artist provides a rough structural guide.

7.1.5 Segmentation Maps

Semantic segmentation maps assign a class label (sky, road, building, person, etc.) to each pixel, typically visualized as color-coded regions. Instance segmentation further distinguishes individual objects of the same class.

Segmentation maps provide region-level semantic control: they specify what should appear where, without constraining appearance within each region. This makes them powerful for scene composition tasks. ControlNet ⁵², T2I-Adapter ⁶³, and Composer ⁶⁵ all support segmentation-conditioned generation.

7.1.6 Other Structural Conditions

Several additional condition types have been explored:

Surface normal maps: encode the orientation of surfaces at each pixel, providing fine-grained geometric detail beyond what depth maps capture.
Lineart: clean line drawings extracted by dedicated models, useful for illustration-style generation.
Inpainting masks: binary masks indicating regions to be regenerated. While not encoding visual content, they define the spatial scope of generation and are concatenated channel-wise to the input latent ².
Optical flow: encodes per-pixel motion between frames, used in video generation and editing.
Color palettes and low-frequency color maps: provide coarse color guidance without structural detail.

7.1.7 Summary

A useful conceptual axis is the information spectrum from purely structural to purely semantic:

Condition Type	Information Level	Spatial Alignment	Typical Use
Canny / HED edges	Structural (outline)	Pixel-aligned	Shape control
Depth map	Structural (3D layout)	Pixel-aligned	Scene layout
Pose / skeleton	Structural (articulated)	Keypoint-aligned	Human pose control
Segmentation map	Semantic (region-level)	Pixel-aligned	Scene composition
Normal map	Structural (surface)	Pixel-aligned	Geometric detail
Face photograph	Semantic (identity)	Not aligned	Identity preservation
Natural image	Semantic (holistic)	Not aligned	Style/content transfer

This spectrum directly influences the choice of embedding algorithm and injection method, as we discuss next.

7.2 Image Embedding Algorithms and Granularity

The choice of encoder determines what information is extracted from the conditioning image and at what granularity. We identify four levels of representational granularity:

Global vector $\left(1 \times d\right)$: a single embedding capturing the holistic semantics of the entire image.
Token sequence $\left(N \times d\right)$: a set of $N$ embedding vectors, each capturing a localized or abstract aspect of the image.
Spatial feature map $\left(h \times w \times c\right)$: a tensor preserving the 2D spatial layout, where each position encodes local features.
Multi-scale feature pyramid $\left\{\left(h_i \times w_i \times c_i\right)\right\}_{i=1}^{L}$: features at multiple spatial resolutions, capturing both fine and coarse structure.

Different encoders produce outputs at different granularities, and projection modules can convert between them. Due to space limitations, I will only briefly introduce some image embedding algorithms here. In fact, image embedding is a type of visual representation learning, which I have discussed in detail in another article for your reference.

Visual Representation Learning

7.2.1 Contrastive Vision-Language Encoders

CLIP (Contrastive Language-Image Pretraining) ³⁴ trains a vision encoder (ViT or ResNet) and a text encoder jointly using a contrastive objective over 400M image-text pairs. The resulting image encoder produces representations that are semantically aligned with natural language — images of “a dog on a beach” and the text “a dog on a beach” are mapped to nearby points in the shared embedding space.

CLIP ViT models output two types of representations:

CLS token (global): the [CLS] token after the final transformer layer provides a single $d$-dimensional vector (e.g., $d = 768$ for ViT-L/14, $d = 1024$ for ViT-H/14) summarizing the entire image. This is the representation used for contrastive matching with text.
Patch tokens (local): the transformer also produces one token per spatial patch (e.g., 16 × 16 patches yield 256 tokens for a 224 × 224 image at patch size 14). These tokens from intermediate or penultimate layers retain spatial information and are richer in visual detail than the CLS token.

The CLS token captures high-level semantics (object categories, scene type, style) but discards fine spatial detail. Patch tokens preserve local visual features (textures, part-level structure, spatial relationships) at the cost of a longer token sequence.

In practice: IP-Adapter ⁵¹ in its base version uses the CLIP ViT-H/14 CLS token and projects it to 4 conditioning tokens. IP-Adapter Plus uses the penultimate-layer patch tokens (257 tokens) and compresses them to 16 tokens via a Perceiver Resampler. DALL·E 2 ³³ conditions on the CLIP ViT-L/14 CLS token. Paint by Example ⁵⁹ uses the CLIP image embedding to replace the text embedding in cross-attention.

Limitation: CLIP is trained with a contrastive objective that favors discriminative features for retrieval. This means CLIP embeddings may discard visual details that are not useful for distinguishing image-text pairs but are important for generation (e.g., fine textures, exact colors, small objects). Furthermore, CLIP’s training data bias can cause it to overweight salient or stereotypical features.

7.2.2 Self-Supervised Vision Encoders

DINOv2 is a self-supervised ViT trained with a combination of self-distillation and masked image modeling on 142M curated images. Unlike CLIP, DINOv2 is not aligned with text — its representations capture pure visual structure without linguistic bias.

DINOv2 features are particularly strong at:

Spatial correspondence: patch tokens encode fine-grained local structure and are well-suited for dense prediction tasks.
Visual similarity: the CLS token captures perceptual similarity that correlates well with human judgment of visual likeness.
Structural understanding: DINOv2 features capture shape, texture, and part structure more faithfully than CLIP, which tends to abstract away visual details in favor of semantic categories.

Trade-off with CLIP: CLIP embeddings excel at semantic image conditioning (generating images that match the conceptual content of a reference). DINOv2 embeddings excel at visual image conditioning (preserving appearance, texture, and structure). Several works combine both: for example, some IP-Adapter configurations use CLIP for semantic alignment and DINOv2 for structural fidelity.

7.2.3 VAE Encoders

Latent diffusion models ² use a pretrained Variational Autoencoder (VAE) to map images between pixel space and a lower-dimensional latent space. The VAE encoder $\mathcal{E}$ maps an image $\mathbf{x} \in \mathbb{R}^{H \times W \times 3}$ to a latent $\mathbf{z} = \mathcal{E}(\mathbf{x}) \in \mathbb{R}^{h \times w \times c}$, typically with a spatial downsampling factor of 8 (so $h = H/8$, $w = W/8$ and $c = 4$) channels.

VAE-encoded latents are pixel-aligned spatial feature maps that preserve the full visual content of the input image at reduced resolution. They are the natural representation for tasks that require spatial fidelity to the conditioning image:

Img2Img / SDEdit ⁶⁶: the source image is VAE-encoded, noise is added to a specified level $t_0 < T$, and the model denoises from there. This preserves the coarse structure while allowing controlled variation.
Inpainting: the unmasked region is VAE-encoded and concatenated to the noisy latent, so the model generates only the masked portion.
InstructPix2Pix ⁶⁷: the source image is VAE-encoded and concatenated channel-wise to the noisy latent, providing a pixel-aligned reference for editing.
Stable Video Diffusion ⁶⁰: the conditioning frame is VAE-encoded and concatenated to each frame’s noisy latent.
OmniGen ⁶⁸: reference images are VAE-encoded and their latent tokens are interleaved with text tokens for joint processing.

Unlike CLIP or DINOv2 embeddings, VAE latents are not semantically abstracted — they contain enough information to reconstruct the original image via the decoder. This makes them ideal when spatial fidelity is paramount but less useful when the goal is to extract only the semantic essence of an image.

7.2.4 Domain-Specific Encoders

Face Recognition Models

For identity-preserving generation, generic vision encoders like CLIP are insufficient — they capture high-level semantics but do not reliably encode the discriminative features that distinguish one person’s face from another’s. Face recognition models such as ArcFace ⁶⁹ and InsightFace are trained with metric learning objectives specifically designed to produce embeddings where same-identity faces are close and different-identity faces are far apart in embedding space.

ArcFace ⁶⁹ uses an additive angular margin loss:

\[L = -\log \frac{e^{s \cdot \cos(\theta_{y_i}+m)}} {e^{s \cdot \cos(\theta_{y_i}+m)} + \sum_{j \ne y_i} e^{s \cdot \cos \theta_j}}\]

where $\theta_{y_i}$ is the angle between the feature and the weight vector of the ground-truth class, $m$ is the angular margin, and $s$ is a scale factor. The resulting embeddings (typically 512-dimensional global vectors) capture identity-discriminative facial features.

InstantID ⁵⁷ uses an InsightFace encoder to extract a 512-d face embedding, projects it through an IP-Adapter-style module for cross-attention injection, and simultaneously uses detected face keypoints as a ControlNet spatial condition. PhotoMaker ⁵⁸ proposes a stacked ID embedding approach that fuses CLIP image features with identity features and merges them into the text token sequence, enabling identity-consistent generation without test-time fine-tuning.

7.2.5 Lightweight Condition Encoders

For structural conditions (edges, depth, pose, segmentation maps), the input is already a simplified visual signal — it does not require the representational power of a large pretrained vision encoder. Instead, small convolutional networks are typically sufficient.

ControlNet ⁵² processes the condition image $\mathbf{c}$ through a lightweight “hint network” consisting of four convolutional blocks (Conv2d + SiLU activation) that progressively downsample the image from pixel resolution to the latent resolution:

\[\mathbf{c}_f = \mathrm{HintNet}(\mathbf{c}) \in \mathbb{R}^{h \times w \times c_{\text{latent}}}\]

This feature map is then added to the input of ControlNet’s trainable encoder copy.

T2I-Adapter ⁶³ uses an even more lightweight encoder (~77M parameters, compared to ControlNet’s ~361M) that produces multi-resolution feature maps through pixel-unshuffle operations and residual blocks. These features are added directly to the UNet’s intermediate activations at matching resolutions.

These lightweight encoders are trained end-to-end with the adapter/control module, so they learn task-appropriate feature extraction jointly with the injection mechanism.

7.2.6 Intermediate Projection Modules

A critical design element is the projection module that maps encoder outputs to the conditioning space expected by the diffusion model’s attention layers. The choice of projection module determines the final granularity and expressiveness of the conditioning signal.

Linear / MLP Projections

The simplest approach is a linear layer or small MLP that maps the encoder output to the dimension of the diffusion model’s cross-attention:

\[\mathbf{e}_{\text{img}} = W \cdot \mathbf{f}_{\text{CLS}} + \mathbf{b}\]

where $\mathbf{f}_{\text{CLS}} \in \mathbb{R}^{d_{\text{enc}}}$ is the CLIP CLS token and $\mathbf{e}_{\text{img}} \in \mathbb{R}^{N \times d_{\text{model}}}$ is reshaped to $N$ tokens. IP-Adapter base ⁵¹ uses this approach with $N = 4$, producing a short token sequence from a single global vector.

For face embeddings, InstantID ⁵⁷ uses an MLP to project the 512-d ArcFace embedding to the cross-attention dimension.

Perceiver Resampler

The Perceiver Resampler, introduced in Flamingo [^flamingo], addresses the problem of variable-length or high-dimensional encoder outputs by compressing them to a fixed number of tokens through cross-attention with learnable queries:

\[\mathbf{e}_{\text{img}} = \mathrm{CrossAttention}(Q_{\text{learn}}, \mathbf{f}_{\text{patch}}, \mathbf{f}_{\text{patch}}) \in \mathbb{R}^{N_q \times d}\]

where $Q_{\text{learn}} \in \mathbb{R}^{N_q \times d}$ are $N_q$ learnable query vectors and $\mathbf{f}_{\text{patch}} \in \mathbb{R}^{M \times d_{\text{enc}}}$ are the $M$ patch tokens from the image encoder. The Perceiver Resampler typically consists of several layers of cross-attention and self-attention among the queries.

IP-Adapter Plus ⁵¹ uses a Perceiver Resampler to compress 257 CLIP patch tokens to 16 conditioning tokens, striking a balance between detail preservation and computational cost. This enables much finer-grained image conditioning than the CLS-only baseline.

Q-Former

The Q-Former from BLIP-2 ⁷⁰ is a more complex variant that inserts learnable query tokens into a transformer that jointly attends to frozen image encoder features. The queries interact with image features through cross-attention layers interleaved with self-attention layers, producing output tokens that bridge the visual and textual modalities.

BLIP-Diffusion ⁵⁵ uses the Q-Former to produce subject representations from reference images. These representations replace or augment the text embeddings in the diffusion model’s cross-attention layers, enabling zero-shot subject-driven generation.

7.3 Injection Methods

Given an image embedding, the next design decision is where and how to inject it into the denoising network. The diffusion model architecture (UNet ² vs. DiT/MMDiT ³ ¹¹) provides several natural injection points, each with distinct properties. We survey the major injection paradigms, then discuss which methods are appropriate for which types of image conditions.

7.3.1 Standard Cross-Attention Replacement

The most direct approach to image conditioning in latent diffusion models is to replace the text embedding with an image embedding in the existing cross-attention layers. In LDM ² and Stable Diffusion, each UNet block contains cross-attention layers of the form:

\[Q = W_Q \cdot z, \qquad K = W_K \cdot c, \qquad V = W_V \cdot c\] \[\mathrm{CrossAttn}(z, c) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

where $z$ is the intermediate UNet feature (queries) and $c$ is the conditioning sequence (keys/values). For text-to-image, $c = c_{\text{text}}$ (CLIP text encoder output). For image conditioning, one can simply set $c = c_{\text{img}}$, where $c_{\text{img}}$ is the projected image embedding.

Versatile Diffusion ⁵⁴ trains a unified model that can switch between text and image conditioning by swapping the cross-attention keys/values. Paint by Example ⁵⁹ replaces the CLIP text embedding with a CLIP image embedding (with information bottleneck regularization to prevent copying) for exemplar-based inpainting.

This approach is simple but mutually exclusive — it is difficult to simultaneously condition on both text and image, since they compete for the same cross-attention pathway.

7.3.2 Decoupled Cross-Attention (IP-Adapter)

IP-Adapter ⁵¹ elegantly solves the text-image competition problem by introducing a separate cross-attention pathway for image features, decoupled from the text cross-attention:

\[z_{\text{out}} = \underbrace{\mathrm{CrossAttn}(z, c_{\text{text}})}_{\text{text branch}} + \lambda \cdot \underbrace{\mathrm{CrossAttn}'(z, c_{\text{img}})}_{\text{image branch}}\]

The image branch uses its own learned projection matrices $W’_K$ and $W’_V$ while sharing the same queries

\[Q = W_Q \cdot z\]

from the UNet features. The scalar $\lambda$ controls the relative influence of image vs. text conditioning. During training, only the image cross-attention projection matrices and the image projection network are optimized; the rest of the UNet remains frozen.

This design has several key advantages:

Compatibility: text and image conditions work simultaneously without interference.
Controllability: $\lambda$ provides a simple knob to control image influence at inference time.
Modularity: the adapter parameters are small (~22M) and can be trained independently of the base model.

IP-Adapter has become one of the most widely adopted image conditioning methods, with variants for SD 1.5, SDXL ²⁹, and adaptations for DiT-based architectures. IP-Adapter-FaceID further combines this framework with ArcFace ⁶⁹ embeddings for identity-preserving generation.

7.3.3 Channel Concatenation

For conditions that are spatially aligned with the target image, a straightforward injection method is to concatenate the condition (or its latent encoding) to the noisy input along the channel dimension:

\[\hat{\epsilon} = \epsilon_\theta\!\left( [z_t \,;\, c_{\text{spatial}}], \, t, \, c_{\text{text}} \right)\]

where $[\,;\,]$ denotes channel-wise concatenation, and $c_{\text{spatial}} \in \mathbb{R}^{h \times w \times c’}$ is a spatially-aligned condition. The first convolutional layer of the UNet is modified to accept $c_{\text{latent}} + c’$ input channels (typically initialized with zero weights for the new channels to preserve pretrained weights).

This method is used in:

Stable Diffusion Inpainting ²: concatenates the VAE-encoded masked image and the binary mask to the noisy latent ($4 + 4 + 1 = 9$ input channels).
InstructPix2Pix ⁶⁷: concatenates the VAE-encoded source image to the noisy latent ($4 + 4 = 8$ input channels), enabling the model to “see” the original image at every denoising step.
Stable Video Diffusion ⁶⁰: concatenates the VAE-encoded conditioning frame to each frame’s noisy latent.

Channel concatenation preserves full spatial alignment between the condition and the generation target, making it ideal for pixel-level transformations. However, it requires modification of the input layer and is less flexible for conditions at different resolutions. It can be combined with other injection methods (e.g., SVD ⁶⁰ uses channel concatenation for spatial conditioning and cross-attention/AdaLN for global CLIP conditioning).

7.3.4 Timestep / AdaLN Conditioning

Global image embeddings can be injected by adding them to the timestep embedding, which modulates the network’s behavior through adaptive layer normalization (AdaLN) or scale-shift operations.

In UNet-based models, the timestep is typically encoded as a sinusoidal embedding, projected through an MLP, and used to modulate intermediate features via FiLM (Feature-wise Linear Modulation):

\[h = \gamma \odot \mathrm{LayerNorm}(x) + \beta\]

where

\[[\gamma, \beta] = \mathrm{MLP}(e_t + e_{\text{img}})\]

$e_t$ is the timestep embedding, and $e_{\text{img}}$ is the global image embedding.

DALL·E 2 ³³ adds the CLIP image embedding to the timestep embedding in its decoder, providing a global conditioning signal that influences all layers. Stable Video Diffusion ⁶⁰ similarly concatenates the CLIP image embedding of the conditioning frame to the timestep embedding.

In DiT-based architectures ¹¹ ³, AdaLN-Zero is the primary conditioning mechanism, where the modulation parameters $[\gamma, \beta, \alpha]$ (scale, shift, gate) are predicted from the sum of timestep and conditioning embeddings. This provides an efficient way to inject global image embeddings in transformer-based diffusion models without adding cross-attention overhead.

Properties: Timestep/AdaLN injection provides a global, spatially uniform modulation. It is well-suited for holistic conditioning (overall style, identity, semantic class) but cannot convey spatially varying information. It is almost always used in combination with other injection methods.

7.3.5 ControlNet: Zero-Convolution Side Branches

ControlNet ⁵² introduced an influential architecture for injecting pixel-aligned spatial conditions (depth, pose, edges, segmentation) into pretrained diffusion models. The key ideas are:

Trainable copy: create a trainable copy of the UNet’s encoder blocks (and middle block), initialized from the pretrained weights. This copy has the same architecture as the frozen encoder, giving it sufficient capacity to process the condition.
Zero convolutions: connect the trainable copy to the frozen UNet through 1 × 1 convolution layers initialized to zero weights and biases:

\[\mathrm{ZeroConv}(x) = W_z \cdot x + b_z, \qquad W_z = 0,\; b_z = 0 \; (\text{at init})\]

This ensures that at the start of training, the ControlNet branch contributes nothing to the output, preserving the pretrained model’s generation quality.

Condition injection: the spatial condition $c$ is processed by a lightweight hint network and added to the input of the trainable copy:

\[y_i = \mathcal{F}^{\text{frozen}}_i(x_i) + \mathrm{ZeroConv}_i\!\left( \mathcal{F}^{\text{trainable}}_i \bigl(x_i + \mathrm{ZeroConv}_0(c_f)\bigr) \right)\]

where $\mathcal{F}^{\text{frozen}}_i$ and $\mathcal{F}^{\text{trainable}}_i$ are the $i$-th block of the frozen and trainable encoders, and $c_f$ is the hint network output.

ControlNet has been adapted for virtually every type of spatial condition (Canny, HED, depth, pose, segmentation, normal maps, scribbles) and has been extended to SDXL ²⁹, DiT architectures, and video models. Its success lies in the principle that a duplicate encoder with zero-initialized connections can learn to modulate a frozen model without destroying its pretrained capabilities.

Uni-ControlNet [^unicontrolnet] extends this idea to handle multiple spatial conditions simultaneously. It introduces a unified framework with a shared condition encoder and adapters for each condition type, allowing composition of multiple controls (e.g., depth + pose + edge) in a single forward pass.

7.3.6 Additive Feature Injection (T2I-Adapter)

T2I-Adapter ⁶³ takes a more lightweight approach than ControlNet. Instead of duplicating the UNet encoder, it uses a small, independent adapter network that extracts multi-scale features from the condition image and adds them to the UNet’s intermediate features at corresponding resolutions:

\[h_i^{\text{UNet}} \leftarrow h_i^{\text{UNet}} + F_i^{\text{adapter}}(c)\]

where $h_i^{\text{UNet}}$ is the $i$-th resolution feature in the UNet encoder, and $F_i^{\text{adapter}}$ is the adapter’s output at the same resolution.

The adapter produces features through a series of residual blocks with pixel-unshuffle downsampling, yielding a multi-resolution feature pyramid. At ~77M parameters (vs. ControlNet’s ~361M for SD 1.5), T2I-Adapter is significantly more parameter-efficient and faster to train, while achieving competitive control quality for many condition types.

Trade-off: T2I-Adapter is less expressive than ControlNet for complex spatial conditions because it lacks the rich feature interactions that ControlNet’s duplicated encoder (with self-attention and cross-attention) provides. However, its lightweight nature makes it attractive for deployment and for composing multiple adapters simultaneously.

7.3.7 Self-Attention Augmentation (ReferenceNet)

For tasks like character animation or virtual try-on, the model must transfer detailed appearance (clothing texture, hair style, facial features) from a reference image to a novel pose. This requires more fine-grained feature transfer than cross-attention with a global or short-sequence embedding can provide.

Animate Anyone ⁵³ introduces a ReferenceNet: a full copy of the UNet that processes the reference image (without noise) and extracts self-attention features. These features are then injected into the main UNet’s self-attention layers by concatenating the reference key/value tensors with those of the main network:

\[K_{\text{aug}} = [K_{\text{self}} \,;\, K_{\text{ref}}], \qquad V_{\text{aug}} = [V_{\text{self}} \,;\, V_{\text{ref}}]\] \[\mathrm{SelfAttn}_{\text{aug}}(z) = \mathrm{softmax}\!\left( \frac{Q \cdot K_{\text{aug}}^T}{\sqrt{d_k}} \right)V_{\text{aug}}\]

This allows each spatial position in the generation process to attend to every spatial position in the reference image, enabling dense, spatially-flexible appearance transfer. Unlike cross-attention with CLIP features (which abstracts away spatial detail), self-attention augmentation preserves texture-level information.

The ReferenceNet approach is powerful but computationally expensive (it doubles the memory and compute for the UNet forward pass). It has been adopted in several character animation and virtual try-on systems where appearance fidelity is critical.

7.3.8 Gated Self-Attention (GLIGEN)

GLIGEN ⁷¹ addresses grounded generation — generating images with specific objects at specified spatial locations (bounding boxes). It introduces new gated self-attention layers into each UNet block:

\[v = \mathrm{SelfAttn}(z, [z \,;\, h_{\text{entity}}])\] \[z \leftarrow z + \tanh(\gamma)\cdot v\]

where $h_{\text{entity}}$ are grounding tokens encoding entity information (class, caption, or visual features) with spatial position embeddings (from bounding box coordinates), and $\gamma$ is a learnable gate initialized to zero.

The zero-initialized gate ensures that the pretrained model is preserved at the start of training (similar to ControlNet’s zero convolutions). The gating mechanism allows the model to gradually learn to incorporate spatial layout information during training.

While GLIGEN’s primary use case is layout-to-image generation with bounding boxes, the gated self-attention mechanism is general and can be adapted for other types of spatially-localized conditions.

7.3.9 Token Prepending in Transformer Architectures

With the shift from UNet to Diffusion Transformer (DiT) ¹¹ and MM-DiT ³ architectures, new injection paradigms have emerged. In MM-DiT (used by SD3 ³ and FLUX), text tokens and noisy image patch tokens are concatenated into a single sequence and processed through joint self-attention:

\[[z_{\text{out}}^{\text{img}} \,;\, z_{\text{out}}^{\text{text}}] = \mathrm{JointAttention}([z^{\text{img}} \,;\, z^{\text{text}}])\]

This architecture provides a natural injection point for image conditioning: prepend or append image condition tokens to the text token sequence, so they participate in the joint attention:

\[[z_{\text{out}}^{\text{img}} \,;\, z_{\text{out}}^{\text{text}} \,;\, z_{\text{out}}^{\text{cond}}] = \mathrm{JointAttention}([z^{\text{img}} \,;\, z^{\text{text}} \,;\, e_{\text{img}}])\]

where $e_{\text{img}}$ are projected image embedding tokens (e.g., from SigLIP or CLIP). This allows natural multi-modal interaction among all conditioning signals — text tokens, image condition tokens, and noisy image tokens all attend to each other.

This approach is architecturally clean: no additional modules, attention layers, or side branches are needed. The image tokens simply become part of the conditioning context. The FLUX Redux image prompt adapter uses this strategy, projecting SigLIP image features through a learned MLP and concatenating them with text tokens for joint attention.

For spatial conditions in DiT architectures, ControlNet-style approaches can be adapted by creating a trainable copy of a subset of transformer blocks, analogous to the UNet encoder copy in the original ControlNet. The spatial condition features are then added to the intermediate representations of the frozen transformer blocks via zero-initialized linear layers.

7.3.10 Matching Image Types to Injection Methods

Different types of image conditions carry fundamentally different types of information, and this dictates the most appropriate injection strategy:

Image Condition	Key Information	Optimal Granularity	Primary Injection Method	Representative Work
Reference image (style/content)	Holistic semantics	Global or token seq	Decoupled cross-attention	IP-Adapter ⁵¹
Reference image (subject)	Subject appearance	Token sequence	Cross-attention (Q-Former)	BLIP-Diffusion ⁵⁵
Face image (identity)	Identity features	Global (ID emb) + spatial (landmarks)	Cross-attention + ControlNet	InstantID ⁵⁷
Face image (identity)	Identity features	Token sequence	Cross-attention + token merging	PhotoMaker ⁵⁸
Depth map	3D scene structure	Multi-scale spatial	ControlNet / T2I-Adapter	ControlNet ⁵²
Pose / skeleton	Body articulation	Multi-scale spatial	ControlNet / T2I-Adapter	ControlNet ⁵²
Edge map (Canny/HED)	Contour structure	Multi-scale spatial	ControlNet / T2I-Adapter	ControlNet ⁵²
Segmentation map	Region semantics	Multi-scale spatial	ControlNet	ControlNet ⁵²
Normal map	Surface geometry	Multi-scale spatial	ControlNet	ControlNet ⁵²
Source image (editing)	Full pixel content	Spatial latent	Channel concatenation	InstructPix2Pix ⁶⁷
Source image (video init)	Full pixel content + semantics	Spatial latent + global	Channel concat + AdaLN	SVD ⁶⁰
Layout (bounding boxes)	Spatial positions	Per-region tokens	Gated self-attention	GLIGEN ⁷¹
Reference (appearance transfer)	Fine-grained texture	Multi-scale self-attn features	Self-attention augmentation	Animate Anyone ⁵³

Several guiding principles emerge:

Principle 1: Spatial conditions require spatially-aligned injection. Depth maps, pose skeletons, edge maps, and segmentation maps all carry information that is meaningful only in spatial correspondence with the generated image. These conditions are best served by ControlNet’s zero-convolution side branches or T2I-Adapter’s additive feature injection, both of which maintain explicit spatial alignment across multiple resolutions.

Principle 2: Semantic conditions benefit from attention-based injection. When the conditioning image provides style, content, or identity guidance without strict spatial alignment, cross-attention injection is preferred because it allows the model to selectively attend to relevant aspects of the conditioning signal. The decoupled cross-attention of IP-Adapter ⁵¹ is particularly effective because it preserves compatibility with text conditioning.

Principle 3: Pixel-level conditions favor channel concatenation. When the conditioning image is a pixel-aligned source that the model should reference at full spatial resolution (e.g., for editing or video conditioning), channel concatenation provides the most direct pathway. This ensures the conditioning signal enters the network at the earliest possible stage and remains spatially registered throughout.

Principle 4: Appearance transfer at texture level requires self-attention augmentation. When the task demands transferring fine-grained textures and patterns from a reference to a novel configuration (e.g., dressing a character in a new pose), methods like ReferenceNet ⁵³ that augment self-attention are necessary because they enable dense spatial correspondence between reference and target features.

Principle 5: Multi-mechanism combination is often necessary. Many practical systems combine multiple injection methods. InstantID ⁵⁷ uses cross-attention for the global identity embedding and ControlNet for facial landmark structure. SVD ⁶⁰ uses channel concatenation for the full conditioning frame and AdaLN for the global CLIP embedding. Composer ⁶⁵ decomposes generation into multiple composable conditions, each injected through appropriate mechanisms (cross-attention for global attributes, spatial injection for layout conditions).

8. References

Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 4195-4205. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴
Rombach R, Blattmann A, Lorenz D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[C]. CVPR, 2022. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷
Esser P, Kulal S, Blattmann A, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis[C]. ICML, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸ ↩²⁹ ↩³⁰ ↩³¹
Black Forest Labs. FLUX.1 Technical Report[R]. https://blackforestlabs.ai, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹
Li Z, Zhang J, Lin Q, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding[J]. arXiv preprint arXiv:2405.08748, 2024. ↩ ↩² ↩³ ↩⁴
Yang Z, Teng J, Zheng W, et al. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer[J]. arXiv preprint arXiv:2408.06072, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Kong W, Tian Q, Zhang Z, et al. Hunyuanvideo: A systematic framework for large video generative models[J]. arXiv preprint arXiv:2412.03603, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Ma N, Goldstein M, Albergo M S, et al. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 23-40. ↩ ↩² ↩³
Bao F, Nie S, Xue K, et al. All are worth words: A vit backbone for diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 22669-22679. ↩ ↩² ↩³ ↩⁴ ↩⁵
Gao S, Zhou P, Cheng M M, et al. Masked diffusion transformer is a strong image synthesizer[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 23164-23173. ↩ ↩² ↩³
Chen J, Yu J, Ge C, et al. Pixart-$\alpha $: Fast training of diffusion transformer for photorealistic text-to-image synthesis[J]. arXiv preprint arXiv:2310.00426, 2023. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020. ↩ ↩²
Dehghani M, Mustafa B, Djolonga J, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution[J]. Advances in Neural Information Processing Systems, 2023, 36: 2252-2274. ↩ ↩² ↩³
Crowson K, Baumann S A, Birch A, et al. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers[C]//Forty-first International Conference on Machine Learning. 2024. ↩
Lu Z, Wang Z, Huang D, et al. Fit: Flexible vision transformer for diffusion model[J]. arXiv preprint arXiv:2402.12376, 2024. ↩ ↩² ↩³
Ma X, Wang Y, Chen X, et al. Latte: Latent diffusion transformer for video generation[J]. arXiv preprint arXiv:2401.03048, 2024. ↩ ↩²
Chen J, Ge C, Xie E, et al. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 74-91. ↩
Gao Z, Pan L, Xie E, et al. Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers[J]. arXiv preprint arXiv:2405.05945, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Press O, Smith N A, Lewis M. Train short, test long: Attention with linear biases enables input length extrapolation[J]. arXiv preprint arXiv:2108.12409, 2021. ↩ ↩²
Ho J, Jain A, Abbeel P. Denoising Diffusion Probabilistic Models[C]. NeurIPS, 2020. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Song Y, Sohl-Dickstein J, Kingma D P, et al. Score-Based Generative Modeling through Stochastic Differential Equations[C]. ICLR, 2021. ↩ ↩² ↩³ ↩⁴
Lipman Y, Chen R T Q, Ben-Hamu H, et al. Flow matching for generative modeling[J]. arXiv preprint arXiv:2210.02747, 2022. ↩
Liu X, Gong C, Liu Q. Flow straight and fast: Learning to generate and transfer data with rectified flow[J]. arXiv preprint arXiv:2209.03003, 2022. ↩
Karras T, Aittala M, Aila T, et al. Elucidating the design space of diffusion-based generative models[J]. Advances in neural information processing systems, 2022, 35: 26565-26577. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Karras T, Aittala M, Lehtinen J, et al. Analyzing and improving the training dynamics of diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 24174-24184. ↩ ↩²
Hoogeboom E, Heek J, Salimans T. simple diffusion: End-to-end diffusion for high resolution images[C]//International Conference on Machine Learning. PMLR, 2023: 13213-13232. ↩ ↩²
Saharia C, Chan W, Saxena S, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding[C]. NeurIPS, 2022. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis[J]. Advances in neural information processing systems, 2021, 34: 8780-8794. ↩ ↩² ↩³ ↩⁴
Podell D, English Z, Lacey K, et al. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis[C]. ICLR, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Lin S, Liu B, Li J, et al. Common diffusion noise schedules and sample steps are flawed[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024: 5404-5411. ↩
Song Y, Dhariwal P, Chen M, et al. Consistency models[J]. 2023. ↩
Nichol A, Dhariwal P, Ramesh A, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models[C]. ICML, 2022. ↩ ↩²
Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical Text-Conditional Image Generation with CLIP Latents[J]. arXiv preprint arXiv:2204.06125, 2022. ↩ ↩² ↩³ ↩⁴
Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision[C]. ICML, 2021. ↩ ↩² ↩³ ↩⁴ ↩⁵
Raffel C, Shazeer N, Roberts A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. Journal of Machine Learning Research, 2020. ↩ ↩² ↩³
Balaji Y, Nah S, Huang X, et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers[J]. arXiv preprint arXiv:2211.01324, 2022. ↩
Chen J, Yu J, Ge C, et al. PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis[C]. ICLR, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Chen J, Yu J, Ge C, et al. PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation[J]. arXiv preprint arXiv:2403.04692, 2024. ↩ ↩²
Li Z, Chen J, Wang Y, et al. Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding[J]. arXiv preprint arXiv:2405.08748, 2024. ↩ ↩²
Kolors Team. Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis[R]. Kuaishou Technology Technical Report, 2024. ↩
Xie E, Yao J, Chen J, et al. Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer[J]. arXiv preprint arXiv:2410.10629, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Betker J, Goh G, Jing L, et al. Improving Image Generation with Better Captions[R]. OpenAI Technical Report, 2023. ↩ ↩²
Ho J, Salimans T. Classifier-Free Diffusion Guidance[J]. arXiv preprint arXiv:2207.12598, 2022. ↩ ↩²
Hertz A, Mokady R, Tenenbaum J, et al. Prompt-to-Prompt Image Editing with Cross Attention Control[C]. ICLR, 2023. ↩
Ruiz N, Li Y, Jampani V, et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation[C]. CVPR, 2023. ↩ ↩²
Gal R, Alaluf Y, Atzmon Y, et al. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion[C]. ICLR, 2023. ↩ ↩²
Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models[C]. ICLR, 2022. ↩
Brooks T, Peebles B, Holmes C, et al. Video Generation Models as World Simulators[R]. OpenAI Technical Report, 2024. ↩
Polyak A, Zohar A, Brown A, et al. Movie Gen: A Cast of Media Foundation Models[J]. arXiv preprint arXiv:2410.13720, 2024. ↩
Karras T, Aittala M, Kynkäänniemi T, et al. Guiding a Diffusion Model with a Bad Version of Itself[C]. NeurIPS, 2024. ↩
Ye H, Zhang J, Liu S, et al. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models[J]. arXiv preprint arXiv:2308.06721, 2023. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 3836-3847. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Hu L. Animate anyone: Consistent and controllable image-to-video synthesis for character animation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 8153-8163. ↩ ↩² ↩³ ↩⁴ ↩⁵
Xu X, Wang Z, Zhang G, et al. Versatile diffusion: Text, images and variations all in one diffusion model[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 7754-7765. ↩ ↩²
Li D, Li J, Hoi S. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing[J]. Advances in Neural Information Processing Systems, 2023, 36: 30146-30166. ↩ ↩² ↩³
Lee W, Lee D, Choi E, et al. Elite: Enhanced language-image toxicity evaluation for safety[J]. arXiv preprint arXiv:2502.04757, 2025. ↩
Wang Q, Bai X, Wang H, et al. Instantid: Zero-shot identity-preserving generation in seconds[J]. arXiv preprint arXiv:2401.07519, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Li Z, Cao M, Wang X, et al. Photomaker: Customizing realistic human photos via stacked id embedding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 8640-8650. ↩ ↩² ↩³
Yang B, Gu S, Zhang B, et al. Paint by example: Exemplar-based image editing with diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 18381-18391. ↩ ↩² ↩³
Blattmann A, Dockhorn T, Kulal S, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets[J]. arXiv preprint arXiv:2311.15127, 2023. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Chen M, Cui L, Zhang W, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation[J]. arXiv preprint arXiv:2508.19320, 2025. ↩
Yang L, Kang B, Huang Z, et al. Depth anything: Unleashing the power of large-scale unlabeled data[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 10371-10381. ↩
Mou C, Wang X, Xie L, et al. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models[C]//Proceedings of the AAAI conference on artificial intelligence. 2024, 38(5): 4296-4304. ↩ ↩² ↩³ ↩⁴
Cao Z, Hidalgo G, Simon T, et al. Openpose: Realtime multi-person 2d pose estimation using part affinity fields[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 43(1): 172-186. ↩
Huang L, Chen D, Liu Y, et al. Composer: Creative and controllable image synthesis with composable conditions[J]. arXiv preprint arXiv:2302.09778, 2023. ↩ ↩²
Meng C, He Y, Song Y, et al. Sdedit: Guided image synthesis and editing with stochastic differential equations[J]. arXiv preprint arXiv:2108.01073, 2021. ↩
Brooks T, Holynski A, Efros A A. Instructpix2pix: Learning to follow image editing instructions[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 18392-18402. ↩ ↩² ↩³
Xiao S, Wang Y, Zhou J, et al. Omnigen: Unified image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025: 13294-13304. ↩
Deng J, Guo J, Xue N, et al. Arcface: Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4690-4699. ↩ ↩² ↩³
Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742. ↩
Li Y, Liu H, Wu Q, et al. Gligen: Open-set grounded text-to-image generation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 22511-22521. ↩ ↩²

Share on

Twitter Facebook LinkedIn

Anbu Huang

1. Embedding in Diffusion Generative Models

2. Spatial Position Embeddings and Injection

2.1 Representations of Position Embeddings: 1D, 2D, and 3D

2.1.1 1D Position Embeddings

2.1.2 2D Position Embeddings

2.1.3 3D Position Embeddings

2.1.4 Summary Comparison

2.2 Position Embedding Algorithms

2.2.1 Sinusoidal (Fourier) Position Embeddings

2.2.2 Learned (Absolute) Position Embeddings

2.2.3 Rotary Position Embeddings (RoPE)

2.2.4 Attention Bias Methods

2.2.5 Convolutional Position Encoding (CPE)

2.2.6 Algorithm Comparison

2.3 Injection Mechanisms: How Position Embeddings Enter the Model

2.3.1 Additive Injection at Input

2.3.2 Rotary Injection within Attention (RoPE)

2.3.3 Attention Bias Injection

2.3.4 Hybrid Injection

2.3.5 Summary: Input-Level vs. Layer-Level Injection

3. Position Embeddings for Multi-Resolution and Multi-Aspect-Ratio Generation

3.1 The Core Challenge

3.2 Interpolation of Absolute Position Embeddings

3.3 Frequency Scaling for Resolution Extrapolation

3.4 Packing Strategies: NaViT

3.5 Progressive Resolution Training

3.6 Handling Multi-Resolution for Video Generation

3.7 Summary of Multi-Resolution Strategies

4. Time and Noise Embeddings in Diffusion and Flow Matching Models

4.1 Parameterizations of the Conditioning Variable

4.1.1 Discrete Timestep

4.1.2 Continuous Time

4.1.3 Noise Level

4.1.4 Log Noise Level

4.1.5 Signal-to-Noise Ratio (SNR) and Log SNR

4.1.6 Mathematical Relationships and Conversions

4.2 Embedding Algorithms

4.2.1 Sinusoidal Positional Embeddings

4.2.2 Random Fourier Features

4.2.3 Learned Fourier Features

4.2.4 Learned Lookup Embeddings

4.2.5 MLP Post-Processing

4.2.6 Combining Time Embeddings with Other Conditioning

4.3 Injection Methods

4.3.1 Additive Injection

4.3.2 Adaptive Group Normalization (AdaGN)

4.3.3 Adaptive Layer Normalization (AdaLN) and AdaLN-Zero

4.3.4 Token-Based Injection

4.3.5 Modulated Convolutions

4.3.6 Cross-Attention Conditioning

4.4 Additional Considerations

4.4.1 Preconditioning and Its Relationship to Time Conditioning

4.4.2 Timestep Sampling and Its Interaction with Embeddings

4.4.3 Handling of Schedule Endpoints

4.4.4 Embedding Dimension and Computational Cost

5. Class Label Embeddings

5.1 Label Embedding and Injection

5.1.1 Learnable Embedding Table + MLP Projection Head

5.1.2 One-Hot Encoding with Linear Projection

5.1.3 Pretrained Encoder Embeddings

5.2 Design of Null Class Embeddings

5.2.1 Training with Label Dropout

5.2.2 Inference: CFG with Null Embeddings

5.3 Multi-Label and Attribute Embedding Design

5.3.1 Multi-Label Embedding via Summation

5.3.2 Multi-Label Embedding via Attention Pooling

5.3.3 CFG for Multi-Label and Attribute Settings

6. Text Conditioning Embeddings

6.1 Text Encoder Choices

6.1.1 CLIP-Family Encoders

6.1.2 T5-Family Encoders

6.1.3 LLM-Based Text Encoders

6.1.4 Summary and Trade-offs

6.2 Granularity of Text Conditioning Embeddings

6.2.1 Token-Level (Sequence) Embeddings

6.2.2 Pooled (Global) Embeddings

6.2.2 Pooled (Global) Embeddings

6.2.3 Hybrid Granularity: Token-Level + Pooled

6.3 Dual-Encoder and Multi-Stream Text Conditioning