Diffusion Architectures Part I: Stability-Oriented Designs

74 minute read

Published:

📚 Table of Contents

In our previous article, we examined how diffusion training stability can be improved from the perspective of loss functions and optimization strategies. Yet, stability is not only a matter of training schedules or objectives — the network architecture itself plays a decisive role in determining whether gradients vanish, explode, or propagate smoothly across noise levels. In this article, we turn to the architectural dimension of stability, comparing U-Net and Transformer-based (DiT) backbones. We highlight how residual pathways, skip connections, and normalization schemes can either stabilize or destabilize training, and present design principles that have proven essential for scaling diffusion models to billions of parameters. Understanding these stability-oriented architectural choices is critical for both researchers aiming to push model robustness and practitioners deploying diffusion systems in real-world settings.


1. Why Architecture Matters for Stability

Network architecture is more than a vessel for function approximation in diffusion models — it is the key component that determines whether training succeeds or fails.


1.1 Gradient Flow, Conditioning, and Stability

The central difficulty of diffusion training lies in its objective: a single network must operate effectively under wildly different conditions. At each training step, the model is fed an input $x_t$ with a different signal-to-noise ratio. This creates a dual-front war for gradient stability:

  • At low-noise levels: The input is almost a clean image. The network’s task is subtle refinement. Here, the risk is gradient explosion. A small error in the network’s output can lead to a massive loss gradient, as the model is penalized heavily for deviating even slightly from the ground truth. An unstable architecture will amplify these gradients, causing divergent training steps.

  • At high-noise levels: The input is nearly pure Gaussian noise. The network’s task is to perceive the faintest hint of structure and begin the process of hallucination. Here, the risk is gradient vanishing. The underlying signal is so weak that gradients can easily diminish to zero as they propagate through a deep network, effectively halting learning for these crucial early stages of generation.

An effective diffusion architecture must therefore act as a precision instrument, maintaining well-behaved gradients across this entire spectrum. This is where architectural choices like residual scaling, normalization layers, and skip connections become critical. They are not just performance-enhancing modules; they are the fundamental regulators of gradient flow, ensuring the network is neither too aggressive at low noise levels nor too passive at high ones.


1.2 Balancing Capacity vs. Robustness

A second challenge lies in the tension between capacity (the ability of the architecture to represent complex distributions) and robustness (the ability to generalize under noisy, unstable conditions).

  • Early U-Net designs offered robustness through simplicity and skip connections, but limited capacity for scaling.
  • Transformer-based diffusion models (DiT, MMDiT-X) introduced massive representational power, but at the cost of more fragile training dynamics.
  • Newer architectures explore hybrid or modular designs — combining convolutional inductive biases, residual pathways, and attention — to find a stable equilibrium between these two competing goals.

1.3 Architecture–Noise Schedule Coupling

Finally, the stability of diffusion training cannot be isolated from the noise schedule. Architectural design interacts tightly with how noise levels are distributed and parameterized:

  • A model with time-dependent normalization layers may remain stable under variance-preserving schedules but collapse under variance-exploding ones.
  • EDM (Elucidated Diffusion Models) highlight that architecture and preconditioning must be co-designed with the training noise distribution, rather than treated as independent modules.

This coupling implies that progress in diffusion training stability comes not only from better solvers or schedules, but from holistic architectural design that accounts for gradient dynamics, representation capacity, and their interplay with noise parameterization.


2. Evolution of Diffusion Architectures

The architectural journey of diffusion models mirrors the evolution of deep learning itself: from simple convolutional backbones to large-scale Transformers, and now toward specialized multi-modal and efficiency-driven designs. Each stage has sought to reconcile two opposing pressures — increasing representational power while preserving training stability. In this section, we trace this trajectory across six key phases.


2.1 Classical U-Net Foundations

The U-Net architecture 1 is the canonical backbone of early diffusion models such as DDPM 2. Although originally proposed for biomedical image segmentation 1 , its encoder–decoder structure with skip connections turned out to be perfectly suited for denoising across different noise levels. The elegance of U-Net lies not only in its symmetry, but also in how it balances global context extraction with local detail preservation. A typical unet structure applied to the training of diffusion models is as shown below.

U-Net Architecture

where ResB is a residual block that consists of multiple “norm-act-conv2d” layers. Attent is a self-Attention block.


2.1.1 Encoder: From Local Features to Global Context

The encoder path consists of repeated convolutional residual blocks and downsampling operations (e.g., strided convolutions or pooling). As the spatial resolution decreases and channel width expands, the network progressively shifts its representational emphasis:

  • High-resolution feature maps (early layers) capture fine-grained local structures — edges, textures, and small patterns that are critical when denoising images at low noise levels.
  • Low-resolution feature maps (deeper layers) aggregate global context — object shapes, spatial layout, and long-range dependencies. This is especially important at high noise levels, when much of the local structure has been destroyed and only global semantic cues can guide reconstruction.

Thus, the encoder effectively builds a multi-scale hierarchy of representations, transitioning from local to global as resolution decreases.


2.1.2 Bottleneck: Abstract Representation

At the center lies the bottleneck block, where feature maps have the smallest spatial size but the largest channel capacity. This stage acts as the semantic aggregator:

  • It condenses the global context extracted from the encoder.
  • It often includes attention layers (in later refinements) to explicitly model long-range interactions. In the classical U-Net used by DDPM, the bottleneck is still purely convolutional, yet it already plays the role of a semantic “bridge” between encoding and decoding.

2.1.3 Decoder: Reconstructing Local Detail under Global Guidance

The decoder path mirrors the encoder, consisting of upsampling operations followed by convolutional residual blocks. The role of the decoder is not merely to increase resolution, but to inject global semantic context back into high-resolution predictions:

  • Upsampling layers expand the spatial resolution but initially lack fine detail.
  • Skip connections from the encoder reintroduce high-frequency local features (edges, boundaries, textures) that would otherwise be lost in downsampling.
  • By concatenating or adding these skip features to the decoder inputs, the network fuses global context (from low-res encoder features) with local precision (from high-res encoder features).

This synergy ensures that the denoised outputs are both semantically coherent and visually sharp.


2.1.4 Timestep Embedding and Conditioning

Unlike the U-Net’s original role in segmentation, a diffusion U-Net must also be conditioned on the diffusion timestep $t$, since the network’s task changes continuously as noise levels vary. In the classical DDPM implementation, this conditioning is realized in a relatively simple but effective way:

  1. Sinusoidal embedding. Each integer timestep $t$ is mapped to a high-dimensional vector using sinusoidal position encodings (analogous to Transformers), ensuring that different timesteps are represented as distinct, smoothly varying signals.

  2. MLP transformation. The sinusoidal embedding is passed through a small multilayer perceptron (usually two linear layers with a SiLU activation) to produce a richer time embedding vector $\mathbf{z}_t$.

  3. Additive injection into residual blocks. In every residual block of the U-Net, $\mathbf{z}_t$ is projected to match the number of feature channels and then added as a bias term to the intermediate activations (typically after the first convolution).

This additive conditioning allows each residual block to adapt its computation based on the current noise level, without introducing extra normalization or complex modulation. The following figure shows the way how to inject timestep $t$ in each residual block.

time embedding injection

However, additive injection suffers from several inherent limitations, making it rarely used in modern state-of-the-art models. By simply adding conditional embeddings to the intermediate features, it only provides a uniform shift across all dimensions, which restricts its expressiveness. This coarse modulation often leads to an information bottleneck, as rich and structured conditions (e.g., long text prompts or spatial guidance) cannot be aligned effectively with image features. Moreover, additive injection lacks adaptability to the varying statistics across different layers of the network, which can cause instability during training and suboptimal conditioning performance. We will discuss more complicated injected strategies in section 4.1 and section 5.1


2.1.5 Why U-Net Works Well for Diffusion

In diffusion training, inputs vary drastically in signal-to-noise ratio:

  • At low noise levels, local details still survive; skip connections ensure these details propagate to the output.
  • At high noise levels, local detail is destroyed; the decoder relies more on global semantics from the bottleneck.
  • Across all levels, the encoder–decoder interaction guarantees that both local fidelity and global plausibility are preserved.

This explains why U-Nets became the default backbone: their multi-scale design matches perfectly with the multi-scale nature of noise in diffusion models. Later improvements (attention layers, latent-space U-Nets, Transformer backbones) all build upon this foundation, but the core idea remains: stability in diffusion training emerges from balanced local–global feature fusion.


2.2 ADM Improvements (Ablated Diffusion Models)

While the classical U-Net backbone of DDPM demonstrated the feasibility of diffusion-based generation, it was still limited in stability and scalability. In the landmark work “Diffusion Models Beat GANs on Image Synthesis” 3, the authors performed extensive ablations to identify which architectural and training choices were critical at ImageNet scale. The resulting recipe is commonly referred to as ADM (Ablated Diffusion Models). Rather than introducing a single new module, ADM represents a carefully engineered upgrade to the baseline U-Net, designed to balance capacity, conditioning, and stability.


2.2.1 Scaling the U-Net: Wider Channels and Deeper Residual Blocks

The most straightforward but highly effective change was scaling up the model. The ADM UNet is significantly larger than the one used in the original DDPM paper.

  • Wider Channels: The base channel count was increased (e.g., from 128 to 256), and the channel multipliers for deeper layers were adjusted, resulting in a much wider network.
  • More Residual Blocks: The number of residual blocks per resolution level was increased, making the network deeper.

Why it helps: A larger model capacity allows the network to learn more complex and subtle details of the data distribution, leading to a direct improvement in sample fidelity.


2.2.2 Multi-Resolution Self-Attention

While DDPM’s UNet used self-attention, it was typically applied only at a single, low resolution (e.g., 16x16). ADM recognized that long-range dependencies are important at various scales.

In ADM, Self-attention blocks were added at multiple resolutions (e.g., 32x32, 16x16, and 8x8). Additionally, the number of attention heads was increased.

  • Attention at higher resolutions (32x32) helps capture relationships between medium-sized features and textures;
  • Attention at lower resolutions (8x8) helps coordinate the global structure and semantic layout of the image.

Why it helps: This multi-scale approach gives the model a more holistic understanding of the image, preventing structural inconsistencies and improving overall coherence.


2.2.3 Conditioning via Adaptive Group Normalization (AdaGN)

This is arguably the most significant architectural contribution of ADM. It fundamentally changes how conditional information (like timesteps and class labels) is integrated into the network.

  • In DDPM: The time embedding was processed by an MLP and then simply added to the feature maps within each residual block. This acts as a global bias, which is a relatively weak form of conditioning.

  • In ADM (AdaGN) 4: The model learns to modulate the activations using the conditional information. The process is as follows: a). The timestep embedding and the class embedding (for class-conditional models) are combined into a single conditioning vector; b). This vector is passed through a linear layer to predict two new vectors: a scale ($\gamma$) and a shift ($\beta$) parameter for each channel. c). Within each residual block, the feature map undergoes Group Normalization, and then its output is modulated by these predicted parameters.

adagn

Why it helps: Modulation is a much more powerful mechanism than addition. It allows the conditional information to control the mean and variance of each feature map on a channel-by-channel basis. This gives the model fine-grained control over the generated features, dramatically improving its ability to adhere to the given conditions (i.e., generating a specific class at a specific noise level).


2.2.4 BigGAN-inspired Residual Blocks for Up/Downsampling

ADM also identifies that the choice of downsampling and upsampling operations affects stability.

  • In DDPM: Downsampling might be a simple pooling (max pooling or mean pooling) or strided convolution, and upsampling might be a standard upsample layer followed by a convolution.

  • In ADM: The upsampling and downsampling operations were integrated into specialized residual blocks (as shown below), a design inspired by the highly successful BigGAN architecture 5. This ensures that information flows more smoothly as the resolution changes, minimizing information loss.

biggan-style residual down block

It favors strided convolutions for downsampling and nearest-neighbor upsampling followed by convolution for upsampling.

Why it helps: This leads to better preservation of features across different scales, contributing to sharper and more detailed final outputs.


2.2.5 Rescaling of Residual Connections

For very deep networks, it’s crucial to maintain well-behaved activations. ADM introduced a simple but effective trick: The output of each residual block was scaled by a constant factor of 1/${\sqrt{2}}$ before being added back to the skip connection.

Why it helps: This technique helps to balance the variance contribution from the skip connection and the residual branch, preventing the signal from exploding in magnitude as it passes through many layers. This improves training stability for very deep models.


2.2.6 Why ADM Matters for Stability

Relative to the classical DDPM U-Net, in conclusion, the ADM UNet is a masterclass in architectural refinement. By systematically enhancing every major component—from its overall scale to the precise mechanism of conditional injection—it provided the powerful backbone necessary for diffusion models to finally surpass GANs in image synthesis quality.


2.3 Latent U-Net: The Efficiency Revolution with Stable Diffusion

While the ADM architecture (Section 2.2) marked the pinnacle of pixel-space diffusion models, achieving state-of-the-art quality by meticulously refining the U-Net, it faced a significant and inherent limitation: computational cost. Training and running diffusion models directly on high-resolution images (e.g., 512x512 or 1024x1024) is incredibly demanding in terms of both memory and processing power. The U-Net must process massive tensors at every denoising step, making the process slow and resource-intensive.

The introduction of Latent Diffusion Models (LDMs) 6, famously realized in Stable Diffusion, proposed a revolutionary solution: instead of performing the expensive diffusion process in the high-dimensional pixel space, why not perform it in a much smaller, perceptually equivalent latent space? This insight effectively decouples the task of perceptual compression from the generative learning process, leading to a massive leap in efficiency and accessibility.


2.3.1 The Core Idea: Diffusion in a Compressed Latent Space

The training architecture of LDM is a two-stage process.

Stage 1: Perceptual Compression. A powerful, pretrained Variational Autoencoder (VAE) is trained to map high-resolution images into a compact latent representation and back. The encoder, $E$, compresses an image x into a latent vector $z = E(x)$. The decoder, $D$, reconstructs the image from the latent, \(\tilde x = D(z)\). Crucially, this is not just any compression; it is perceptual compression, meaning the VAE is trained to discard high-frequency details that are imperceptible to the human eye while preserving critical semantic and structural information.

Stage 2: Latent Space Diffusion. Instead of training a U-Net on images x, we train it on the latent codes $z$. The forward diffusion process adds noise to $z$ to get $z_t$, and the U-Net’s task is to predict the noise in this latent space.

The impact of this shift is dramatic. A 512x512x3 pixel image (786,432 dimensions) can be compressed by the VAE into a 64x64x4 latent tensor (16,384 dimensions)—a 48x reduction in dimensionality. The U-Net now operates on these much smaller tensors, enabling faster training and significantly lower inference requirements.

The full generative (inference) process for a text-to-image model like Stable Diffusion is as follows:

  • Stage 1: Text Prompt $\to$ Text Encoder $\to$ Conditioning Vector $c$.
  • Stage 2: Random Noise $z_T$ $\to$ U-Net Denoising Loop in Latent Space, conditioned on $c$ $\to$ Clean Latent \(z_0\).
  • Stage 3: Clean Latent $z_0$ $\to$ VAE Decoder $\to$ Final Image $x$.

2.3.2 Architectural Breakdown of the Latent U-Net

While the Latent U-Net, exemplified by Stable Diffusion, inherits its foundational structure from the ADM architecture (i.e., a U-Net with residual blocks and multi-resolution self-attention), it introduces several profound modifications. These are not mere tweaks but fundamental redesigns necessary to operate efficiently in a latent space and to handle the sophisticated conditioning required for text-to-image synthesis.


1️⃣ Conditioning Paradigm Shift: From AdaGN to Cross-Attention

This is the most critical evolution from the ADM architecture. ADM perfected conditioning for global, categorical information (like a single class label), whereas Stable Diffusion required a mechanism for sequential, localized, and compositional information (a text prompt).

  • ADM’s Approach (Global Conditioning): ADM injected conditions (time embedding, class embedding) via adaptive normalization (AdaGN/FiLM): A single class embedding vector is combined with the time embedding. This unified vector is then projected to predict scale and shift parameters that modulate the entire feature map within a ResBlock.

    Limitation: This is an “all-at-once” conditioning signal. The network knows it needs to generate a “cat,” and this instruction is applied globally across all spatial locations. It cannot easily handle a prompt like “a cat sitting on a chair,” because the conditioning signal for “cat” and “chair” cannot be spatially disentangled.

  • Latent U-Net’s Approach (Localized Conditioning): Latent U-Nets instead integrate text tokens (from a frozen text encoder, e.g., CLIP) using cross-attention at many/most resolutions: Instead of modulating activations, the U-Net directly incorporates the text prompt’s token embeddings at multiple layers. A text encoder (e.g., CLIP) first converts the prompt “a cat on a chair” into a sequence of token embeddings: [, a, cat, on, a, chair, ].

    These embeddings form the Key (K) and Value (V) for the cross-attention mechanism. The U-Net’s spatial features act as the Query (Q). At each location in the feature map, the Query can “look at” the sequence of text tokens and decide which ones are most relevant. A region of the feature map destined to become the cat will learn to place high attention scores on the “cat” token, while a region for the chair will attend to the “chair” token.

    \[\begin{aligned} \small & \text{Attn}(Q,\,K,\,V)=\text{softmax}(QK^T/\sqrt{d})V,\quad \\[10pt] & Q=\text{latent features},\quad K,V=\text{text tokens} \end{aligned}\]

2️⃣ Self-Attention vs. Cross-Attention

The modern Latent U-Net block now has a clear and elegant division of labor: Self-Attention and Cross-Attention. This dual-attention design is profoundly more powerful than ADM’s global conditioning, enabling the complex compositional generation we see in modern text-to-image models. Cross-Attention usually at multiple scales, often after each residual block.

  • Self-Attention: capture long-range dependencies and ensure global structural/semantic consistency. Convolutions and skip connections excel at local detail, but struggle to enforce coherence across distant regions. Self-attention fills this gap. Self-Attention typically only at low-resolution stages (e.g., 16×16 or 8×8).

  • Cross-Attention: enable multimodal alignment, projecting language semantics onto spatial locations. Coarse-scale cross-attn controls global layout, style, and subject placement. Fine-scale cross-attn refines local textures, materials, and fine details.


3️⃣ Adapting the Training Objective for Latent Space: $v$-prediction

Another key adaptation that makes training deep and powerful U-Nets in latent space more robust and effective is the choice of objective.

We have discussed four common prediction targets to train diffusion models (post). Through analysis and comparison, we found that $v$-prediction has the best stability, this led LDM to shift towards using $v$ instead of $\epsilon$ to achieve more stable training results.


4️⃣ Perceptual Weighting in Latent Space

In Latent Diffusion Models, diffusion is not trained directly in pixel space but in the latent space of a perceptual autoencoder (a VAE). The latent representation is much smaller (e.g., 64×64×4 for a 512×512 image) and is designed to preserve perceptually relevant information.

However, if we train the diffusion model with a plain mean squared error (MSE) loss on the latent vectors, you implicitly treat all latent dimensions and spatial positions as equally important. In practice:

  • Some latent channels carry critical perceptual information (edges, textures, semantics).
  • Other channels encode redundant or imperceptible details.

Without adjustment, the diffusion model may spend too much gradient budget on parts of the latent space that have little impact on perceptual quality. Perceptual weighting introduces a weighting factor $w(z)$ into the diffusion loss, so that errors in perceptually important latent components are emphasized:

\[\mathcal{L} = \mathbb{E}_{z,t,\epsilon}\big[\, w(z)\,\|\epsilon - \epsilon_\theta(z_t, t)\|^2 \,\big],\]

There are different ways to define $w(z)$:

  1. Channel-based weighting from the VAE

    • Estimate how much each latent channel contributes to perceptual fidelity (e.g., by measuring sensitivity of the VAE decoder to perturbations in that channel).
    • Assign larger weights to channels that strongly affect the decoded image.
  2. Feature-based weighting (perceptual features)

    • Decode the latent $z$ back to image space $x=D(z)$.
    • Extract perceptual features $\phi(x)$ (e.g., from a VGG network or LPIPS).
    • Estimate how sensitive these features are to changes in $z$. Latent dimensions with high sensitivity get higher weights.
  3. Static vs. adaptive weighting

    • Static: Precompute a set of per-channel weights (averaged over the dataset).
    • Adaptive: Compute weights on the fly per sample or per timestep using Jacobian-vector product tricks.

In summary:

  • Focus on perceptual quality: Gradients are concentrated on latent components that most affect the decoded image quality.
  • Suppress irrelevant gradients: Channels that mostly encode imperceptible high-frequency noise are downweighted.
  • More stable training: The denoiser learns where its predictions matter most, reducing wasted updates and improving convergence.

2.3.3 Evolution to SDXL: Refining the Latent U-Net Formula

While Stable Diffusion models 1.x and 2.x established the power of the Latent Diffusion paradigm, Stable Diffusion XL (SDXL) 7 represents a significant architectural leap forward. It is not merely a larger model but a systematically re-engineered system designed to address the core limitations of its predecessors, including native resolution, prompt adherence, and aesthetic quality. The following sections provide a detailed technical breakdown of the key architectural innovations in SDXL.


1️⃣ A Two-Stage Cascade Pipeline: The Base and Refiner Models

To achieve the highest level of detail and aesthetic polish, SDXL introduces an optional but highly effective two-stage generative process, employing two distinct models.

The Core Problem: The Global and Local Dilemma. A single diffusion model faces a challenging balancing act. During the denoising process, it must simultaneously:

  •  Establish Global Coherence: Correctly interpret the prompt to create a plausible composition, with proper object placement, anatomy, lighting, and color harmony. This is a low-frequency task.
  •  Render Fine Details: Generate intricate, high-frequency textures like skin pores, fabric weaves, hair strands, and sharp edges. This is a high-frequency task.

For a single model, these two objectives are often in conflict. Over-focusing on global structure can lead to a "painterly" or smoothed-out look, while over-focusing on detail can sometimes compromise the overall composition, leading to artifacts or anatomical errors. The cascade pipeline elegantly solves this by assigning each task to a specialized expert.

Stage 1: The Base Model. The base model is the workhorse of the pipeline. Its primary responsibility is to create a structurally sound and semantically correct foundation for the image.

  • Function: It performs the bulk of the denoising process, starting from pure Gaussian noise ($z_T$) and running for a majority of the sampling steps (e.g., from step 1000 down to step 200). It is tasked with the “heavy lifting” of interpreting the prompt and translating it into a coherent visual scene.
  • Strengths: The base model, being very large and powerful (e.g., the 2.6 billion parameter U-Net in SDXL), excels at:
    • Composition and Layout: Arranging objects in the scene according to the prompt.
    • Color and Lighting: Establishing a consistent and harmonious color palette and lighting scheme.
    • Semantic Accuracy: Ensuring subjects and concepts from the prompt are correctly represented and interact plausibly.
  • Output: It produces a latent representation that is structurally complete and visually strong. You can think of its output as a high-quality, fully-formed painting that is conceptually finished but could still benefit from a final layer of polish and fine-tuning.

Stage 2: The Refiner Model. The refiner model takes over where the base model leaves off. It is not a general-purpose generator; it is a specialist trained for a very specific task: adding the final touch of realism and quality.

  • Function: It takes the latent output from the base model as its starting point. This is crucial—it does not start from pure noise. It starts from a latent that is already in a low-noise state. It then performs a small number of final denoising steps (e.g., from step 200 down to 0).
  • Specialized Training: This is its secret weapon. The refiner is trained exclusively on images with a low level of noise. This makes it an “expert” at high-fidelity rendering. It doesn’t need to know how to create a face from chaos, but it is exceptionally good at taking an almost-perfect face and adding realistic skin texture, subtle reflections in the eyes, and individual hair strands.
  • Strengths: The refiner focuses solely on: High-Frequency Detail Injection: sharpening edges, clarifying text, and adding intricate textures; Artifact Correction: Smoothing out minor imperfections or noise left over by the base model; Aesthetic Enhancement: Applying a final “polish” that pushes the image towards a higher level of photorealism or artistic finish.

Impact: This ensemble-of-experts approach allows for a division of labor. The base model ensures robust composition, while the refiner specializes in aesthetic finalization. The result is an image that benefits from both global coherence and local, high-frequency richness, achieving a level of quality that is difficult for a single model to produce consistently.


2️⃣ A Substantially Larger and More Robust U-Net Backbone

The most apparent upgrade in SDXL is its massively scaled-up U-Net, which serves as the core of the base model. This expansion goes beyond a simple increase in parameter count to include strategic design choices.

  • Increased Capacity: The SDXL base U-Net contains approximately 2.6 billion parameters, a nearly threefold increase compared to the ~860 million parameters of the U-Net in SD 1.5. This additional capacity is crucial for learning the more complex and subtle features required for high-resolution 1024x1024 native image generation.

  • Deeper and Wider Architecture: The network’s depth (number of residual blocks) and width (channel count) have been significantly increased. Notably, the channel count is expanded more aggressively in the middle blocks of the U-Net. These blocks operate on lower-resolution feature maps (e.g., 32x32) where high-level semantic information is most concentrated. By allocating more capacity to these semantic-rich stages, the model enhances its ability to reason about object composition and global scene structure, directly mitigating common issues like malformed anatomy (e.g., extra limbs) seen in earlier models at high resolutions.

  • Refined Attention Mechanisms: The distribution and configuration of the attention blocks (both self-attention and cross-attention) across different resolution levels were re-evaluated and optimized. This ensures a more effective fusion of spatial information (from the image features) and semantic guidance (from the text prompt) at all levels of abstraction.

Impact: This fortified U-Net backbone is the primary reason SDXL can generate coherent, detailed, and aesthetically pleasing images at a native 1024x1024 resolution, a feat that was challenging for previous versions without significant post-processing or specialized techniques.


3️⃣ The Dual Text Encoder: A Hybrid Approach to Prompt Understanding

Perhaps the most innovative architectural change in SDXL is its departure from a single text encoder. SDXL employs a dual text encoder strategy to achieve a more nuanced and comprehensive understanding of user prompts.

  • OpenCLIP ViT-bigG: This is the larger of the two encoders and serves as the primary source of high-level semantic and conceptual understanding. Its substantial size allows it to grasp complex relationships, abstract concepts, and the overall sentiment or artistic intent of a prompt (e.g., “a majestic castle on a hill under a starry night”).

  • CLIP ViT-L: The second encoder is the standard CLIP model used in previous Stable Diffusion versions. It excels at interpreting more literal, granular, and stylistic details in the prompt, such as specific objects, colors, or artistic styles (e.g., “a red car,” “in the style of Van Gogh”).

Mechanism of Fusion: During inference, the input prompt is processed by both encoders simultaneously. The resulting sequences of token embeddings are then concatenated along the channel dimension before being fed into the U-Net’s cross-attention layers. This combined embedding provides the U-Net with a richer, multi-faceted conditioning signal.

Impact: This hybrid approach allows SDXL to reconcile two often competing demands: conceptual coherence and stylistic specificity. The model can understand the “what” (from ViT-L) and the “how” (from ViT-bigG) of a prompt with greater fidelity, leading to superior prompt adherence and the ability to generate complex, well-composed scenes that match the user’s intent more closely.


4️⃣ Micro-Conditioning for Resolution and Cropping Robustness

SDXL introduces a subtle yet powerful form of conditioning that directly addresses a common failure mode in generative models: sensitivity to image aspect ratio and object cropping.

  • The Problem: Traditional models are often trained on square-cropped images of a fixed size. When asked to generate images with different aspect ratios, they can struggle, often producing unnatural compositions or cropped subjects.

  • SDXL’s Solution: During training, the model is explicitly conditioned on several metadata parameters in addition to the text prompt:

    • original height and original width: The dimensions of the original image before any resizing or cropping.
    • crop top and crop left: The coordinates of the top-left corner of the crop.
    • target height and target width: The dimensions of the final generated image.

Mechanism of Injection: These scalar values are converted into a fixed-dimensional embedding vector. This vector is then added to the sinusoidal time embedding before being passed through the AdaGN layers of the U-Net’s residual blocks.

Impact: By making the model “aware” of the resolution and framing context, SDXL learns to generate content that is appropriate for the specified canvas. This significantly improves its ability to handle diverse aspect ratios and dramatically reduces instances of unwanted cropping, leading to more robust and predictable compositional outcomes.


2.4 Transformer-based Designs

While latent U-Nets (Section 2.3) significantly improved efficiency and multimodal conditioning, they still retained convolutional inductive biases and hierarchical skip pathways. Due to the success of Transformers 8 in large-scale NLP tasks, the next stage in the evolution of diffusion architectures explores whether Transformers can serve as the primary backbone for diffusion models. This marks a decisive shift from UNET-dominated designs to Transformer-native backbones, most notably exemplified by the Diffusion Transformer (DiT) 9 family and its successors.


2.4.1 Motivation for Transformer Backbones

Convolution-based U-Nets provide strong locality and translation invariance, but they impose rigid inductive biases:

  • Locality and Global Context: Convolutions capture local patterns well but require deep hierarchies to model long-range dependencies. U-Nets solve this partially through down/upsampling and skip connections, but global coherence still relies on explicit attention layers carefully placed at coarse scales.

    Transformers, by contrast, model all-pair interactions directly via attention, making them natural candidates for tasks where global semantics dominate.

  • Benefit from Scaling laws: Recent work shows that Transformers scale more predictably with dataset and parameter count, whereas CNNs saturate earlier. Diffusion training, often performed at very large scales, benefits from architectures that exhibit similar scaling behavior.

  • Unified multimodal processing.: Many diffusion models condition on text or other modalities. Transformers provide a token-based interface: both images (as patch embeddings) and text (as word embeddings) can be treated uniformly, simplifying multimodal alignment.

Thus, a Transformer-based backbone promises to simplify design and leverage established scaling laws, potentially achieving higher fidelity with cleaner training dynamics.


2.4.2 Architectural Characteristics of Diffusion Transformers

The Diffusion Transformer (DiT) proposed by Peebles & Xie (2022) was the first systematic exploration of replacing U-Nets with ViT-style Transformers for diffusion.

  • Patch tokenization: Instead of convolutions producing feature maps, the input (pixels or latents) is divided into patches (e.g., 16×16), each mapped to a token embedding. This yields a sequence of tokens that a Transformer can process natively.
  • Class and time conditioning: As in ADM, timestep and class embeddings are injected not by concatenation but by scale-and-shift modulation of normalization parameters. The different is that, instead of AdaGN, DiT uses Adaptive LayerNorm (AdaLN).
  • Global self-attention: Unlike U-Nets, where attention is inserted at selected resolutions, DiT-style models apply self-attention at every layer. This uniformity eliminates the need to decide “where” to place global reasoning — it is omnipresent.
  • Scalability: Transformers scale more gracefully with depth and width. With large batch training and data-parallelism, models like DiT-XL can be trained efficiently on modern accelerators.

DiT demonstrates that diffusion models do not require convolutional backbones. However, it also reveals that training Transformers for denoising is more fragile: optimization can collapse without careful normalization (AdaLN-Zero) and initialization tricks.

The Diffusion Transformer (DiT) architecture is as shown below.

dit


2.4.3 Hybrid Designs: Marrying U-Net and Transformer Strengths

Pure Transformers are computationally expensive, especially at high resolutions. To balance efficiency and quality, several hybrid architectures emerged:

  • U-Net with Transformer blocks — many models, including Stable Diffusion v2 and SDXL, interleave attention layers (which are Transformer sub-blocks) into convolutional U-Nets. This compromise preserves locality while still modeling long-range dependencies.

  • Perceiver-style cross-attention. Conditioning (e.g., text embeddings) can be injected via cross-attention, a Transformer-native mechanism that naturally fuses multimodal tokens.

  • MMDiT (Multimodal DiT) in Stable Diffusion 3. Here, both image latents and text tokens are treated as tokens in a single joint Transformer sequence. Queries, keys, and values are drawn from both modalities, enabling a fully symmetric text–image fusion mechanism without the asymmetry of U-Net cross-attention layers.


2.5 Extensions to Video and 3D Diffusion

The success of diffusion models on static images naturally prompted their extension to more complex, higher-dimensional data like video and 3D scenes. This required significant architectural innovations to handle the temporal dimension in video and the complex geometric representations of 3D objects, all while maintaining consistency and stability.


2.5.1 Video U-Net: Introducing the Temporal Dimension

The most direct way to adapt an image U-Net for video generation is to augment it with mechanisms for processing the time axis. This gave rise to the Spatio-Temporal U-Net.


1️⃣ Temporal Layers for Consistency

A video can be seen as a sequence of image frames, i.e., a tensor of shape $(B, T, C, H, W)$. A standard 2D U-Net processes each frame independently, leading to flickering and temporal incoherence. To solve this, temporal layers are interleaved with the existing spatial layers:

  • Temporal Convolutions: 3D convolution layers (e.g., with a kernel size of $(3, 3, 3)$ for $(T, H, W)$) replace or supplement the standard 2D convolutions. This allows features to be aggregated from neighboring frames.
  • Temporal Attention: This is the more powerful and common approach. After a spatial self-attention block that operates within each frame, a temporal self-attention block is added. In this block, a token at frame $t$ attends to corresponding tokens at other frames ($t-1$, $t+1$, etc.). This explicitly models long-range motion and appearance consistency across the entire video clip.

Models like Stable Video Diffusion (SVD) build upon a pretrained image LDM and insert these temporal attention layers into its U-Net. By first training on images and then fine-tuning on video data, the model learns temporal dynamics while leveraging the powerful prior of the image model.


2️⃣ Conditioning on Motion

To control the dynamics of the generated video, these models are often conditioned on extra information like frames per second (FPS) or motion “bucket” IDs representing the amount of camera or object motion. This conditioning is typically injected alongside the timestep embedding, allowing the model to generate videos with varying levels of activity.


2.5.2 3D Diffusion: Generating Representations Instead of Pixels

Generating 3D assets is even more challenging due to the complexity of 3D representations (meshes, voxels, NeRFs) and the need for multi-view consistency. A breakthrough approach has been to use diffusion models to generate the parameters of a 3D representation itself, rather than rendering pixels directly.


1️⃣ Diffusion on 3D Gaussian Splatting Parameters

3D Gaussian Splatting (3D-GS) has emerged as a high-quality, real-time-renderable 3D representation. A scene is represented by a collection of 3D Gaussians, each defined by parameters like position (XYZ), covariance (scale and rotation), color (RGB), and opacity.

Instead of a U-Net that outputs an image, models like 3D-GS Diffusion use an architecture (often a Transformer) to denoise a set of flattened Gaussian parameters. The process works as follows:

  1. Canonical Representation: A set of initial Gaussian parameters is created (e.g., a sphere or a random cloud).
  2. Diffusion Process: Noise is added to this set of parameters (position, color, etc.) over time.
  3. Denoising Network: A Transformer-based model takes the noisy parameter set and the conditioning signal (e.g., text or a single image) and predicts the clean parameters.
  4. Rendering: Once the denoised set of Gaussian parameters is obtained, it can be rendered from any viewpoint using a differentiable 3D-GS renderer to produce a 2D image.

This approach elegantly separates the generative process (in the abstract parameter space) from the rendering process. By operating on the compact and structured space of Gaussian parameters, the model can ensure 3D consistency by design, avoiding the view-incoherence problems that plague naive image-space 3D generation.


3. Stability-Oriented Architectural Designs

Training stability is a fundamental requirement for scaling diffusion models. While optimization strategies such as learning-rate schedules or variance weighting are important, the architecture itself largely determines whether gradients vanish, explode, or propagate smoothly across hundreds of layers. In diffusion models, two major architectural paradigms dominate: U-Net backbones (used in DDPM, ADM, Stable Diffusion) and Transformer backbones (DiT, MMDiT, SD3). These two paradigms embody different design philosophies, which in turn dictate distinct stabilization strategies.


3. Architectural Philosophies: U-Net vs. DiT

Before diving into specific mechanisms, we must first understand the high-level topological differences between U-Nets and DiTs. The very shape of these architectures dictates their inherent strengths, weaknesses, and, consequently, where the primary “pressure points” for stability lie.


3.1 U-Net Macro Topology

We have already covered most of the knowledge about the UNET structure in Section 2, and here we only provide a brief summary. The U-Net family is characterized by its encoder–decoder symmetry with long skip connections that link features at the same spatial resolution.

  • Strengths: Skip connections preserve fine-grained details lost during downsampling, and they dramatically shorten gradient paths, alleviating vanishing gradients in very deep convolutional stacks.

  • Weaknesses: The powerful influence of the skip connections can be a double-edged sword. Overly strong skips can dominate the decoder, reducing reliance on deeper semantic representations. They can also destabilize training when the variance of encoder features overwhelms decoder activations.

  • Implication: For U-Nets, stabilization hinges on how residual and skip pathways are regulated — via normalization, scaling, gating, or progressive fading.


3.2 DiT Macro Topology

Similarily, we have already covered most of the knowledge about the UNET structure in Section 2, and here we only provide a brief summary. Diffusion Transformers (DiT) abandon encoder–decoder symmetry in favor of a flat stack of homogeneous blocks. Every layer processes a sequence of tokens with identical embedding dimensionality.

  • Strengths: This design is remarkably simple, uniform, and highly scalable. It aligns perfectly with the scaling laws that have driven progress in large language models.

  • Weaknesses: Without long skips, there are no direct gradient highways. The deep, uninterrupted stack can easily amplify variance or degrade gradients with each successive block. A small numerical error in an early layer can be compounded dozens of times, leading to catastrophic failure. Stability pressure is concentrated entirely on per-block design.

  • Implication: For DiTs, the central question is how to stabilize each block internally, rather than balancing long-range skips.


3.3 Summary of Divergent

Overall, there are significant differences in the optimization of stability between this two architectures.

  • U-Net: Stability is equal to manage the interplay between skip connections and residual blocks.

  • DiT: Stability is equal to ensure each block is numerically stable under deep stacking. This divergence explains why U-Nets emphasize skip/residual design, while DiTs emphasize normalization, residual scaling, and gated residual paths.


4. Stabilization in U-Net Architectures

U-Net stability can be understood through two complementary levers that act on different failure modes. First, a control system that sets the operating point of every block at a given noise level and condition (timestep, class, text): it prevents variance drift and miscalibrated activations by modulating normalized features (e.g., FiLM on top of GN). Second, the signal pathways that carry information through residual and skip connections: they must be regulated so that high-variance shallow features do not overwhelm deep semantics, and gradients remain well-conditioned across scales. Concretely, U-Nets suffer from (i) variance explosion or attenuation when features are fused naively across resolutions, (ii) skip dominance that drowns out semantic bottleneck representations, and (iii) noise leakage at high-noise timesteps. The next subsection addresses the first lever via AdaGN (a stable, per-sample modulation compatible with small batches), while Section 4.2 formalizes pathway regulation (scaling, gating, attention-based fusion) to keep information flow numerically and semantically balanced.


4.1 The Control System: Conditioning via AdaGN

U-Nets are typically trained with small batch sizes on high-resolution images, making BatchNorm 10 unreliable. Instead, GroupNorm (GN) 11 is the default choice: it normalizes channels in groups, independent of batch statistics.

Adaptive GroupNorm (AdaGN) extends this by predicting scale and shift parameters from conditioning vectors (timestep, class, text).

\[{\text {AdaGN}}(x,c)=\gamma(c)⋅{\text {GN}}(x)+\beta(c)\]

This design enables it to balance stability and controllability.

  • Stability: GN prevents variance drift under small batches.

  • Control: AdaGN injects noise-level and semantic awareness at every block.

The following figure shows how to inject conditional signal using AdaGN in an UNET residual block.

adagn


4.2 The Signal Pathways: Skip Connections and Residual Innovations

With a control system in place, the focus shifts to the structural integrity of the network’s information pathways. Skip connections are the hallmark of U-Net architectures, but also their most delicate component: they can stabilize gradient flow and preserve details, yet, if uncontrolled, they destabilize training or amplify noise. Below, we unify the notation and then categorize stabilization strategies into five major families. Each strategy is explained in terms of its idea, formula, solved problem, and scope (generic U-Net vs diffusion-specific).

Before discussing different optimization strategies, let us first uniformly define the symbols used in this section.

SymbolDefinition / DescriptionShape / Type / Constraint
Decoder feature ($\mathbf{d}$)Feature map from deep decoder layer$\mathbf{d} \in \mathbb{R}^{C \times H \times W}$
Skip feature ($\mathbf{s}$)Feature map from shallow encoder layer (via skip connection)$\mathbf{s} \in \mathbb{R}^{C \times H \times W}$
Fusion output ($\mathbf{y}$)Result after fusing $\mathbf{d}$ and $\mathbf{s}$$\mathbf{y}$ (same shape as input)
Normalization ($\mathrm{Norm}(\cdot)$)Normalization function (e.g., GroupNorm, AdaGN, LayerNorm)$\mathrm{Norm}(\cdot)$
Residual input ($\mathbf{x}$)Input to residual block$\mathbf{x}$
Residual function ($f(\cdot)$)Transformation within residual block (e.g., conv, attention)$f(\cdot)$
Scaling factor ($\alpha$)Scalar or per-channel weight to control skip feature magnitude$\alpha \in \mathbb{R}$ or $\alpha \in \mathbb{R}^C$
Gate ($g_t$)Channel-wise gating mask, often conditioned on timestep or prompt$g_t \in [0,1]^C$, $g_t = h(t, \text{cond})$
Noise scale ($\sigma_t$)Noise level at diffusion timestep $t$ (used for gating/scaling)$\sigma_t \in \mathbb{R}_{\geq 0}$
Drop mask ($m$)Bernoulli mask for stochastic skip connection dropout$m \sim \mathrm{Bernoulli}(p)$
Fourier transform ($\mathcal{F}(\cdot)$)2D Fourier transform of feature map$\mathcal{F}(\cdot)$
Low-frequency part ($\mathbf{s}_{LP}$)Low-pass component of $\mathbf{s}$ in frequency domain$\mathbf{s}_{LP}$
High-frequency part ($\mathbf{s}_{HP}$)High-pass component of $\mathbf{s}$ in frequency domain$\mathbf{s}_{HP}$

4.2.1 Variance and Amplitude Control

Skip connections enable the direct transmission of high-frequency encoder features into the decoder pathway, While this design greatly enriches spatial detail and accelerates gradient propagation, uncontrolled fusion of skip feature and decoder feature can causes several critical challenges. Take additive fusion as an example, decoder features $\mathbf{d}$ are additively fused with skip features $\mathbf{s}$ (from the encoder) to form:

\[\mathbf{y} = \mathbf{d} + \mathbf{s}.\]
  • Variance explosion: The variance of the fused output is:

    \[\mathrm{Var}(y) = \mathrm{Var}(d) + \mathrm{Var}(s) + 2\,\mathrm{Cov}(d,s).\]

    Shallow features ($s$) are directly derived from the input or a few convolutional layers, and their numerical distribution (i.e., variance and magnitude) can be highly volatile and unstable. In deep networks, if every skip connection behaves this way, the variance accumulates layer by layer, and inevitably lead to an explosion of the gradient.

  • Distribution Mismatch: $s$ comes from shallow layers, and preserves high-frequency details, edges, textures, the distribution is “sharp” with high variance. In contrast, the decoder input features ($d$) undergo multiple downsampling, nonlinear transformations, and deep processing, resulting in a typically smoother numerical distribution with smaller variance.

    When these two types of feature maps with vastly different statistical properties are fused, output distribution becomes “torn”, the network cannot interpret the hybrid signal.

  • Semantic Suppression: $s$ carries “detail semantics” and $d$ carries “global semantics”. If $s$ is too strong (unscaled), then $y \approx s$, deep semantic features are drowned out by shallow details. The network degenerates into “copying shallow features”, and loses deep abstraction capability.

To address this, we introduce a three-step fusion pipeline consisting of (i) skip normalization, (ii) skip scaling, and (iii) post-fusion variance re-normalization.

\[\mathbf{s}\ \xrightarrow{\text{Norm}}\ \hat{\mathbf{s}}=\text{Norm}(\mathbf{s}) \ \xrightarrow{\times \alpha}\ \tilde{\mathbf{s}} = \alpha\,\hat{\mathbf{s}} \ \xrightarrow{\text{Add+Renorm}}\ \mathbf{y} = \frac{\mathbf{d}+\tilde{\mathbf{s}}}{\sqrt{1+\alpha^2}}.\]

Together, these steps align the distributions of encoder and decoder features, regulate the relative strength of skip connections, and ensure that the variance of the fused output remains stable across depth.

  • Distributionally Aligned: since the $s$ distribution have a “gap” (both Semantic Gap and Statistical Gap) compared to the distribution $d$, normalize the skip activation before fusion \(\hat{\mathbf{s}} = \mathrm{Norm}(\mathbf{s})\) can mitigate this problem.

    By applying the same normalization method used in the layer of $d$ (e.g., GroupNorm or AdaGN with conditional modulation) to $s$, this statistical mismatch can be mitigated, making the fusion process smoother and more “balanced.”

  • Amplitude-Controlled: Even after normalization, the relative contribution of skip features must be explicitly regulated. Excessive skip strength can cause the decoder to ignore abstract semantic representations, while insufficient skip strength results in blurred or incomplete reconstructions. A scalar or channel-wise scaling coefficient $\alpha$ is applied to the normalized skip:

    \[\tilde{\mathbf{s}} = \alpha \,\hat{\mathbf{s}}.\]
  • Variance-Stable: When decoder and skip branches are additively fused across multiple scales, variance may accumulate layer by layer, resulting in distributional drift and unstable gradients.

    After addition, we normalize the fused output to preserve variance equilibrium:

    \[\mathbf{y} = \frac{\mathbf{d} + \tilde{\mathbf{s}}}{\sqrt{1 + \alpha^2}} = \frac{\mathbf{d} + \alpha\,\hat{\mathbf{s}}}{\sqrt{1+\alpha^2}}.\]

4.2.2 Conditional and Gated Control

Variance and Amplitude Control strategies make fusion numerically well-posed, addressing the static and universal problem of numerical instability. However, it is content-agnostic and unconditioned, and treats all samples, timesteps, channels, and spatial positions the same way.

However, in many tasks, beyond the issue of global stability, there is also a need to address a dynamic information flow strategy problem that is closely tied to the current input content or conditions. This involves enabling the network to “adapt to the context,” dynamically determining what information the skip connections should transmit and how much information to transmit based on the specific situation. For example, in diffusion models, inputs with varying noise levels and different text conditions should be have distinct impacts on the information flow through skip connections.

conditional and gated control strategies are effective tools for solving this problem, let $\mathbf{d}$ be the decoder feature and $\mathbf{s}$ the skip. Instead of a uniform gain $\alpha$, it learns a gate $g \in [0,1]$ that depends on time/noise, content, and scale. $g$ can be a scalar, channel-wise ($\mathbb{R}^C$), or spatial ($\mathbb{R}^{C\times H\times W}$) gate.

In summary, B decides what information to pass and how much of it to pass from the skip connection at any given conditions, three-step fusion pipeline can be expressed as.

\[\mathbf{s}\ \xrightarrow{\text{Norm}}\ \hat{\mathbf{s}}=\text{Norm}(\mathbf{s}) \ \xrightarrow{\text{B: gate}}\ \tilde{\mathbf{s}} = g\,\odot\,\hat{\mathbf{s}} \ \xrightarrow{\text{Add+Renorm}}\ \mathbf{y} = \frac{\mathbf{d}+\tilde{\mathbf{s}}}{\sqrt{1+ \|g\|_2^2 }}.\]
  • B1: Time/Noise-Conditioned Gating: In diffusion models, different SNR regions contain varying amounts of information, which requires separate consideration during fusion. The core idea is to automatically “tighten” the skip connection pathway at high noise levels (low SNR) to prevent noise leakage, and gradually “open” it at low noise levels (high SNR) to allow details to pass through.

    \[g_t \;=\; \text{softmax}\,\big(W\,h(t) + b\big)\]

    It is necessary to ensure the monotonicity of g, if $h(t)$ represented by noise leval, g should be a monotonically decreasing function of noise; if $h(t)=\text{SNR}(t)$ is used, g should be a monotonically increasing function of SNR. This ensures suppression at high noise levels and amplification at low noise levels.

  • B2: Content-Aware Gating: While B1 (time/noise gating) adapts to SNR over time, but within a given timestep it still cannot tell which channels/regions carry signal vs. clutter. Content-Aware Gating makes the skip contribution content-dependent (channel-wise and/or spatial), passing salient structures while suppressing distractors (background textures, misaligned edges, ringing, condition-inconsistent patterns). This improves semantic alignment and reduces artifacts without sacrificing detail.

    Common strategies include Channel-wise SE gate: Perform global average pooling (GAP) on the skip $s$ and decoder features $d$, then use a small MLP to predict the weights for each channel, binarizing the weights through a sigmoid function to selectively retain useful channels while suppressing noisy or redundant ones.

    \[g \;=\; \text{softmax}\,\big(\mathrm{MLP}(\mathrm{GAP}([\mathbf{d},\hat{\mathbf{s}}]))\big)\in[0,1]^C\]

    Spatial gate (mask): Concatenate $d$ and $s$, then use a convolution to predict a 2D mask. Allow skip connections to contribute details only in important image regions (object edges, main subjects), while suppressing background or irrelevant parts.

    \[g=\text{softmax}\,\big(\mathrm{Conv}_{3\times3}([\mathbf{d},\hat{\mathbf{s}}])\big)\in(0,1)^{1\times H\times W}\]

    Cross-attention gate: Use the decoder features $d$ as the Query and the skip features $s$ as Key/Value, employing attention to determine which information to extract from the skip connections.

    \[\begin{align} & Q=\psi(\mathbf{d})\in\mathbb{R}^{(HW)\times d_k},\quad K=V=\phi(\hat{\mathbf{s}})\in\mathbb{R}^{(HW)\times d_k} \\[10pt] & g = \mathrm{Attn}(Q,K,V)=\mathrm{softmax}\Big(\frac{QK^\top}{\sqrt{d_k}}\Big)V \end{align}\]

4.2.3 Fusion Design and Frequency Control

Categories A and B are designed around the additive fusion case, where encoder features $s$ and decoder features $d$ are directly summed. This simple operator is lightweight but fragile: it easily suffers from variance explosion, distribution mismatch, and uncontrolled information flow, necessitating additional stabilization (A) and gating (B).

A natural question follows: are there alternative fusion operators that can inherently alleviate some of these issues? Category C explores this design space. Instead of relying solely on additive fusion, one can concatenate features and learn a projection, use attention to align semantics, or even manipulate features in the frequency domain. These alternatives do not eliminate all problems, but they shift the trade-offs: some stabilize variance by design, some improve content alignment, and some suppress noise more effectively.

  • C1: Concat + 1×1 Conv: Instead of summing $s$ and $d$, concatenate them and learn a projection layer. The projection can automatically calibrate scale and combine information more flexibly. The formulation is

    \[y = W[d,\,s], \quad W \in \mathbb{R}^{C_{\text{out}}\times(C_d+C_s)}.\]

    Compared with Additive, Variance and distribution mismatch are absorbed by the learnable projection $W$, so explicit A-style normalization and variance re-scaling are less critical. The Conv projection provides richer combinations of semantics and detail.

    However, Skip still carries irrelevant or noisy content, so B-style content gating (B2) or time/noise gating (B1) can still improve robustness: $y = W[d,\,g\,\odot\,s]$

  • C2: Attention-Based Fusion: Use attention to align skip features with decoder queries, letting the decoder select what is relevant rather than blindly accepting all skip information. The formulation is

    \[y = \mathrm{Attn}(Q=d,\;K=V=\phi(s)).\]

    Compared with Additive, Attention softmax normalizes contributions, avoiding variance explosion (no need for A-style variance renorm). It also provides inherent content selection, covering much of B2’s role.

    However, at high noise levels in diffusion, attention may still amplify random patterns; B1 time/noise gating is still necessary.


4.2.4 Initialization and Regularization

Even with careful fusion operators (A–C), a UNet remains a very deep residual network: dozens of convolutional and attention blocks connected by long skip pathways. Such depth creates risks of unstable optimization: gradient explosion/vanishing, early over-reliance on skip shortcuts, and difficulty in learning deep semantic representations.

Category D introduces techniques that stabilize training from the inside out. These strategies ensure that the model starts from a near-identity mapping, keeps gradients well-conditioned, and does not overly depend on shallow features. While many originate in general deep residual learning, they apply naturally to UNets and are especially beneficial in diffusion models, where stable training under long horizons and noisy inputs is critical.

  • D1: Residual Scaling: Scale residual branches so that their variance does not explode with depth. For a residual block with input $x$ and transformation $f(x)$, the output of residual block is

    \[y=x+\beta\,f(x)\,\qquad\,\beta \approx \frac{1}{\sqrt{2}}\]

    Residual scaling is common across all deep residual networks, prevents variance growth across stacked residual blocks, improving optimization stability.

  • D2: Zero-Init / Zero-Gamma: Initialize the final layer of each residual branch close to zero, so the network initially behaves like an identity map.

    \[y=x+\varepsilon\,f(x)\,\qquad\,\varepsilon \approx 0\,\text{at init.}\]

5. Stabilization in DiT Architectures

In contrast to U-Nets, stabilization in DiT-style architectures must confront a different set of fragilities. Since DiTs are built from homogeneous Transformer blocks without long skip-connections, the entire burden of stability falls on the internal normalization and conditioning layers. This makes the design of modulation schemes—such as adaptive LayerNorm—central to controlling the signal scale and preserving gradient health across dozens or even hundreds of layers. At the same time, attention pathways replace convolutional skip fusions, so numerical drift manifests not as skip dominance but as unbalanced token interactions and layerwise variance escalation. The following subsection formalizes the first stabilization lever: AdaLN, a per-token modulation scheme that provides fine-grained control and has become the de facto choice in modern DiT implementations.


5.1 The Control System: Conditioning via AdaLN

Transformers operate on a fundamentally different data structure: a sequence of tokens of shape $(N, S, D)$, Here, the entire $D$-dimensional vector represents the complete set of features for a single token. This makes Layer Normalization (LayerNorm) 12 the ideal choice, as it normalizes across the $D$-dimensional embedding for each token independently.

Consequently, Diffusion Transformers (DiT) 9 employ Adaptive Layer Normalization (AdaLN). The principle is identical to AdaGN, but LayerNorm replaces GroupNorm. While the concept has roots in models like StyleGAN2 13, its application in Transformers for diffusion was popularized by DiT.

\[{\text {AdaLN}}(x,c)=\gamma(c)⋅{\text {LN}}(x)+\beta(c)\]

Gate parameter. Many implementations augment AdaLN with a learnable gate $g$, gating provides a way to dynamically control information flow. The most impactful application has been within the MLP layers, through Gated Linear Units (GLU) and its variants. The SwiGLU variant, proposed by Shazeer (2020) 14, was shown to significantly improve performance by introducing a data-driven, multiplicative gate that selectively passes information through the feed-forward network.

\[y = x + g \cdot \big(\gamma(s) \cdot \text{LN}(x) + \beta(s)\big).\]
  • $g$ is often initialized near 0, ensuring that the residual branch is “silent” at initialization.
  • During training, $g$ learns how strongly conditioning should influence the layer.
  • This mechanism stabilizes optimization and allows gra

The following figure shows how to inject conditional signal using AdaLN in a transformer block.

adaln

Finally, to solve the unique stability challenges of training extremely deep Transformers, the AdaLN-Zero strategy was introduced in DiT. This is an initialization trick that also functions as a form of gating—a “master switch” that gates the entire residual branch to zero at the start of training. The mechanism is as follows:

  • The AdaLN parameters $\gamma$ and $\beta$ are initialized to produce an identity transformation.
  • Crucially, the output projection of each attention block and the second linear layer of each MLP block are initialized with all-zero weights.

This ensures that at the start of training, every residual block initially computes the identity function. This creates a pristine “skip-path” for gradients, ensuring stable convergence from the outset. As training progresses, the network learns non-zero weights, gradually “opening the gate” to the residual connections. AdaLN-Zero, combined with the power of gated MLPs and adaptive normalization, provides the trifecta of control and stability needed to scale Transformers to billions of parameters for diffusion models.

❓ From Stabilizer to Controller: Why Normalization Became the Injection Point

💡  Normalization has been one of the most fundamental tools in deep learning. Initially, its role was purely that of a stabilizer: preventing exploding/vanishing gradients, reducing internal covariate shift, and enabling deeper networks to converge. However, in diffusion models normalization has undergone a conceptual shift. It is no longer only a numerical safeguard but has become the primary controller for injecting conditioning information such as timesteps, noise scales, class labels, or multimodal embeddings.

  Diffusion models forced normalization to evolve. Unlike discriminative models, the denoising network operates under extreme noise regimes, from nearly clean signals to pure Gaussian noise. This requires the model to adapt its feature statistics dynamically depending on the timestep, noise level, or conditioning prompt. Normalization layers became the natural site for this adaptation because:

  •  Ubiquity: every residual block already contains a normalization step, so conditioning can permeate the entire network.
  • Direct statistical control: all of the normalization schemes rely on the learnable affine parameters, the scale ($\gamma$) and shift ($\beta$), to restore feature representation flexibility. These parameters provided a perfect, pre-existing interface for control. By replacing $\gamma$ and $\beta$ with dynamic functions of the conditional vectors, the normalization layer could be "hijacked" to modulate the characteristics of every feature map.
  •  Lightweight but global influence: a small MLP projecting a condition vector can control feature distributions across all layers without altering the convolutional or attention weights directly.

  Thus, normalization transitioned into a controller: not just stabilizing activations, but embedding semantic and structural conditions into the very statistics of feature maps.


5.2 The Signal Pathways: Enabling Deep Stacks in DiT

With U-Nets, stability hinges on balancing skip and residual pathways. In contrast, Diffusion Transformers (DiT) eliminate skip connections entirely: the model is a deep stack of homogeneous Transformer blocks. Consequently, all signal propagation depends solely on residual pathways inside each block. If these are not properly regulated, numerical instabilities such as variance explosion or vanishing gradients compound across depth, leading to catastrophic divergence. Below, we establish unified notation and present five major stabilization strategies tailored for DiTs.

SymbolDescriptionNotes / Constraints
Input ($\mathbf{x}$)Token sequence entering a Transformer block$\mathbf{x} \in \mathbb{R}^{N \times D}$ (N tokens, D dims)
Residual transform ($f(\cdot)$)Sub-layer function (attention or MLP)Includes linear projections, softmax, activations
Residual scaling factor ($\beta$)Scalar applied to residual branch$\beta \in \mathbb{R}$
Normalization ($\mathrm{LN}$)LayerNorm or Adaptive LayerNorm (AdaLN)Applied pre/post sub-layer
Gating ($g$)Learnable or condition-dependent multiplicative gate$g \in [0,1]$
DropPath mask ($m$)Stochastic depth mask$m \sim \mathrm{Bernoulli}(p)$
Block output ($\mathbf{y}$)Result after residual fusion$\mathbf{y} = \mathbf{x} + \text{residual}$

5.2.1 Normalization-Centric Stabilization

In DiTs, all gradient flow relies on residual branches. The placement of normalization (LayerNorm) in each Transformer block critically determines whether gradients remain stable. Two canonical designs exist:

  • Post-LN 8: Normalization is applied after the residual addition.

    \[\mathbf{y} = \mathrm{LN}\big(\mathbf{x} + f(\mathbf{x})\big)\]
  • Pre-LN 9: Normalization is applied before the sub-layer transformation, with the residual connection wrapping it.

    \[\mathbf{y} = \mathbf{x} + f\big(\mathrm{LN}(\mathbf{x})\big)\]

Post-LN vs. Pre-LN

The different between Pre-LN and Post-LN lies in LayerNorm Jacobian and how it sits in the residual pathway. For an input vector $\mathbf{x} =[x_1, x_2, \dots, x_D]^\top \,\in\,\mathbb{R}^D$, LN computes:

\[\mathbf{y} \;=\; \mathrm{LN}(\mathbf x)\;=\;\frac{\mathbf{\tilde x}}{\sigma}\]

For convenience, we will not consider scale and shift, and this will not affect our final conclusion.

\[\mu \;=\; \tfrac{1}{D}\mathbf{1}^\top x,\quad \tilde x \;=\; x-\mu\mathbf{1},\quad \sigma^2 \;=\; \tfrac{1}{D}\|\tilde x\|_2^2,\quad \sigma \;=\; \sqrt{\sigma^2+\varepsilon},\]

Differentiate $\mathbf{y}=\mathbf{\tilde x}/\sigma$, the Jacobian of LN , denoted as $J_{\rm LN} \in \mathbb{R}^{D \times D}$, with respect to $x$ is

\[J_{\rm LN}\;=\;\frac{1}{\sigma}\left(I-\frac{1}{D}\mathbf{1}\mathbf{1}^\top-\frac{1}{D}\mathbf{y}\mathbf{y}^\top\right)\]

Component form (useful for implementation checks):

\[\frac{\partial y_i}{\partial x_j}=\frac{1}{\sigma}\left(\delta_{ij}-\frac{1}{D}-\frac{1}{D}y_i y_j\right)\]

where $\delta_{ij} = 1$ if and only if $i=j$. $J_{\rm LN}$ has two eigenvalue: 0 and $1/\sigma$, eigenvalue 0 has two different Eigenvectors: mean direction $\mathbf{1}$ and current normalized direction $y$.

\[J_{\rm LN} \cdot \mathbf{1}=\mathbf{0}\quad(\text{shift invariance}),\qquad J_{\rm LN} \cdot \mathbf{y}=\mathbf{0}\quad(\text{scale invariance}).\]

LN kills the two directions to which it is invariant. $J_{\rm LN}$ is exactly a projection onto the subspace orthogonal to 1 and $y$, followed by a uniform scaling $1/\sigma$. Thus $J_{\rm LN}$ is the orthogonal projector onto the $(D-2)$-dimensional subspace

\[\mathcal{S} \;=\;\{v\in\mathbb{R}^D:\ \mathbf{1}^\top v=0,\; y^\top v=0\}\]

Geometrically, For any upstream gradient $g=\partial\mathcal{L}/\partial y$,

\[J_{\rm LN} \cdot g \;=\; \frac{1}{\sigma}\Big(g-\overline{g}\,\mathbf{1}-\overline{gy}\,y\Big), \quad \overline{g}=\tfrac{1}{m}\mathbf{1}^\top g,\;\overline{gy}=\tfrac{1}{m}y^\top g.\]

So LN backprop is equal to orthogonal projection onto $\mathcal{S}$ followed by a scalar scaling $1/\sigma$. Let’s analyze Post-LN block (LN after the residual sum)

\[x_{l+1} \;=\; \mathrm{LN}(x_l + f((x_l)))\]

Let \(g_{l+1}=\partial\mathcal{L}/\partial x_{l+1}\). Backprop:

\[\begin{align} \frac{\partial \mathcal{L}}{\partial x_l} & = \frac{\partial \mathcal{L}}{\partial x_{l+1}} \cdot \frac{\partial x_{l+1}}{\partial x_{l}} = J_{\rm LN}(x_l + f(x_l)) \cdot (I+J_{f}(x_l)) \cdot g_{l+1} \\[10pt] & \approx (I + J_{f}(x_l)) \cdot \frac{1}{\sigma} \cdot g_{l+1}. \end{align}\]

Over $L$ stacked post-LN layers,

\[\|g_0\|_2 \;\approx\; \Big(\prod_{\ell=0}^{L-1} \|I+f'_\ell\|_2 \Big) \Big(\prod_{\ell=0}^{L-1} \frac{1}{\sigma_\ell}\Big) \|g_L\|_2 \,\]

So the product of LN scalings $\prod (1/\sigma_\ell)$ can force vanishing or exploding gradients. Similary, we analyze Pre-LN block

\[x_{l+1}=x_l + f(\mathrm{LN}(x_l))\]

Backprop with \(g_{l+1}=\partial\mathcal{L}/\partial x_{l+1}\):

\[\begin{align} \frac{\partial \mathcal{L}}{\partial x_l} & = \frac{\partial \mathcal{L}}{\partial x_{l+1}} \frac{\partial x_{l+1}}{\partial x_l} \\[10pt] & = g_{l+1} (I+J_{f}({\rm LN}(x_l))J_{\rm LN}(x_l)) \end{align}\]

Over $L$ stacked post-LN layers,

\[\|g_0\|_2 \;\approx\; \Big(\prod_{\ell=0}^{L-1} \|I+J_{f}({\rm LN}(x_l))J_{\rm LN}(x_l)\|_2 \Big) g_{L}\]

Key consequence. There is an identity bypass term \(g_{l+1}\) that does not pass through any LN Jacobian. Thus, even though the LN Jacobian does appear inside the residual branch, the cross-layer gradient always includes a clean identity path:

\[g_l \;=\; \underbrace{g_{l+1}}_{\text{bypass}} \;+\; \underbrace{(J_f)(J_{\rm LN}) g_{l+1}}_{\text{perturbation}}.\]

so no multiplicative chain of $1/\sigma$ factors hits the through-layer gradient.


5.2.2 Residual Scaling for Deep Transformers

Similar with UNET, in DiTs, the residual block prevent variance accumulation by explicitly scaling residual contributions.

\[\mathbf{y} = \mathbf{x} + \beta f(\mathbf{x}), \quad \beta \approx \tfrac{1}{\sqrt{2}}\]

Without scaling, stacking hundreds of residual paths amplifies variance uncontrollably, destabilizing gradients.


5.2.3 Zero-Init and AdaLN-Zero Initialization

Problem. Training very deep DiT stacks is fragile because each block multiplies gradients by the Jacobian of its sublayers. If these Jacobians are even slightly ill-conditioned, the product across dozens or hundreds of layers can explode or collapse. A robust recipe is to make every block start life as an identity map in both the forward and backward passes, then let learning gradually “turn on” the residual paths. This is exactly what Zero-Init and AdaLN-Zero do.


1️⃣ From AdaLN to AdaLN-Zero

Recall AdaLN modulates per-token activations via LayerNorm with learnable scale/shift produced from a condition $c$ (timestep, text, etc.):

\[\operatorname{AdaLN}(u;c)=\gamma(c)\odot \operatorname{LN}(u)+\beta(c).\]

AdaLN-Zero augments this with (i) zero-initialized modulation deltas and (ii) zero-initialized residual gates / heads, so that each residual branch contributes exactly 0 at step 0 and the whole block is \(x_{\text{out}}=x_{\text{in}}\). Concretely:

  • Parameterize modulation as $(1+\Delta\gamma(c))$ and $\Delta\beta(c)$ with \(\Delta\gamma=\mathbf 0,\ \Delta\beta=\mathbf 0\) at init.
  • Add a learnable gate $g(c)$ (scalar / channel-wise / per-head), with $g=\mathbf 0$ at init.
  • Zero-initialize the attention output projection and the second linear of the MLP (the one after the activation), so the residual produces zero regardless of inputs.

These choices ensure that, at initialization, the residual path is “muted”, and modulation is identity—hence the whole block behaves as an exact identity. We use the following table to summarize the differences in mathematical forms between AdaLN and AdaLN-Zero

ModelOutput Formula (Example: Single Sublayer)Initialization
AdaLN\(x' = x + F(\mathrm{LN}(x);\ \gamma(c),\ \beta(c))\)$\gamma(c)$ and $\beta(c)$ are initialized normally (random or standard init).
AdaLN-Zero\(x' = x + g(c)\odot F(\mathrm{LN}(x);\ 1+\Delta\gamma(c),\ \Delta\beta(c))\)All \((\Delta\gamma,\ \Delta\beta,\ g)\) are initialized to 0.

2️⃣ Why this stabilizes deep DiTs (forward & backward)

Consider a pre-LN Transformer block (Self-Attn + MLP), written schematically as

\[x\ \mapsto\ x + g(c)\odot F\!\Big(\operatorname{LN}(x); \ 1+\Delta\gamma(c),\ \Delta\beta(c)\Big).\]

At initialization \(g=\mathbf 0,\ \Delta\gamma=\mathbf 0,\ \Delta\beta=\mathbf 0\), this implies that \(x_{\text{out}}=x\). The block Jacobian is

\[J_{\text{block}} \;=\; I\;+\;\mathrm{diag}(g)\,J_F(u)\,J_{\operatorname{LN}}(x).\]

So \(J_{\text{block}}=I\) at init; stacking $L$ blocks gives

\[J_{\text{net}}=\prod_{\ell=1}^L (I + \underbrace{\mathrm{diag}(g_\ell)\,J_{F_\ell}\,J_{\operatorname{LN}_\ell}}_{\text{=0 at init}})\;\approx\; I,\]

so gradients have a clean identity highway across depth. As learning proceeds, \(g,\Delta\gamma,\Delta\beta\) grow away from zero, gradually enabling the residual computations without ever clobbering the signal scale. This matches the “identity bypass” view of pre-LN stacks: the through-layer gradient always contains an unscaled identity component.


3️⃣ What benefits does adaln-zero bring

AdaLN-Zero offers a crucial advantage: it guarantees that, at the very beginning of training, the entire network is a clean highway composed of stacked identity mappings. Gradients can flow from the output back to the input without any distortion or loss.

As training progresses, the model gradually and smoothly learns how much each residual branch should contribute, rather than being flooded at initialization by large, randomly scaled residual signals. This gentle “warm-up” of the residual pathways prevents unstable activations and chaotic gradient dynamics.

Such a property is vital for the stability of very deep Transformer-based architectures—especially diffusion Transformers (DiT)—where even small initialization imbalances can accumulate across hundreds of layers and derail optimization. AdaLN-Zero effectively transforms the entire network into a stable, well-conditioned system that can scale depth without sacrificing trainability.


5.2.4 Gated FFN for Stable Transformers

Motivation: Limitations of Standard FFN. In the Transformer architecture, each block contains a feed-forward network (FFN) applied position-wise after the attention sublayer. The standard FFN is defined as

\[\mathrm{FFN}(x) = \phi(xW_1 + b_1)W_2 + b_2,\]

where $\phi$ is a pointwise nonlinearity activation function such as ReLU or GELU. Although simple and effective, this linear–activation–linear design has two notable drawbacks:

  1. Uniform activation: all hidden channels are modulated only by the same scalar nonlinearity, lacking fine-grained channel selection.
  2. Gradient scaling: the Jacobian spectrum depends entirely on $\phi$, making deep stacks sensitive to the choice of activation and initialization.

To address this, researchers introduced gated feed-forward networks, where two parallel projections are combined multiplicatively, enabling channel-wise gating.

Unified Formulation of Gated FFN: A gated FFN replaces the single expansion with two parallel linear layers:

\[a = xW_a+b_a,\qquad b=xW_b+b_b,\]

and defines the hidden activation as

\[h(x) = \phi(a)\odot \psi(b),\]

Here, $\phi$ is the content activation and $\psi$ is the gate. This design provides dimension-wise control: channels can be selectively suppressed or emphasized, improving stability in deep stacks. After that, followed by a projection back:

\[\mathrm{GatedFFN}(x) = h(x)W_o+b_o, \quad W_o\in\mathbb{R}^{d_h\times d}.\]

The idea originates from the Gated Linear Unit (GLU) 15, introduced in language modeling, subsequent work (Shazeer, 2020) 14 16 proposed several variants that differ only in the choice of activation for the content stream, We summarize the differences between the standard FFN, gated FFN, and its variants in the following table.

VariantFormula for $h$Gate $\psi$Content $\phi$Notes
Standard FFN$\phi(xW_1)$–ReLU / GELUSingle stream; no gating
GLU$(xW_a)\odot\sigma(xW_b)$SigmoidIdentityFirst gated FFN
ReGLU$\mathrm{ReLU}(xW_a)\odot(xW_b)$IdentityReLUSimpler, stronger than GLU
GeGLU$\mathrm{GELU}(xW_a)\odot(xW_b)$IdentityGELUAligns with Transformer defaults
SwiGLU$\mathrm{Swish}(xW_a)\odot(xW_b)$IdentitySwishCurrent SOTA; adopted in PaLM, LLaMA

6. Conclusion

Diffusion models succeed or fail not only because of losses, schedules, or optimizers, but because of their architectural choices. Across this article, we framed stability through two complementary levers: a control system that calibrates activations and embeds conditions (e.g., AdaGN/AdaLN), and signal pathways that govern how information flows (residual and skip routes). In U-Nets, stability is primarily about regulating skip fusion so shallow, high-variance details do not drown out deep semantics; in DiTs, which remove long skips, stability lives or dies by per-block design of normalization, residual scaling, and initialization. This separation of concerns explains why U-Nets emphasize skip regulation while DiTs emphasize normalization-centric residual design.

  • U-Net (encoder–decoder with long skips). Stable training relies on distributional alignment and amplitude regulation at fusion: normalize the skip, scale it, and re-normalize the sum to keep variance in check; then add gating (time/noise-aware, content-aware, or attention-based) so the network learns when and where detail should pass. Conditioning is most robust when implemented as modulation on normalization (AdaGN), which simultaneously stabilizes statistics and injects semantics across depth.

  • DiT (flat Transformer stacks). Without long skips, all gradient flow rides the residual branch, so architectural stability hinges on: (i) Pre-LN placement to preserve a direct gradient highway; (ii) residual scaling (e.g., (\beta!\approx!1/\sqrt{2})) to prevent variance blow-up across depth; (iii) identity-preserving initialization (Zero-Init / AdaLN-Zero) so each block starts near an identity in forward and backward passes; and (iv) gated FFN variants to tame activation dynamics.

A practical checklist for stability-oriented design.

  1. Choose normalization as a controller, not a patch. Use GN/AdaGN in U-Nets (small batches, per-sample stability), AdaLN in DiTs; prefer Pre-LN in Transformers to keep gradients well-conditioned across hundreds of layers.
  2. Regulate additive paths. For U-Net skips, apply (normalize → scale → renormalize); for residuals in any backbone, use explicit scaling (e.g., (\beta)) to avoid cumulative variance growth.
  3. Gate by signal quality. Make skip (or residual) contribution a function of timestep/SNR and content, so high-noise states don’t leak clutter and low-noise states pass crisp detail.
  4. Start from identity. Zero-init residual-path scalars (AdaLN-Zero) to ensure early-training stability, then learn controlled deviations from identity.

Broader lesson. Stability is an emergent property of architecture–conditioning–noise-schedule co-design. Treat normalization as the global controller, and treat skip/residual pathways as regulated channels whose variance and semantics must be balanced. In U-Nets this means fusing cautiously; in DiTs it means stacking safely.

Looking ahead. Promising directions include: (i) hybrid backbones that marry U-Net multi-scale inductive biases with Transformer token-level global reasoning; (ii) tighter architecture–preconditioning coupling for different noise parameterizations; and (iii) stability-aware lightweighting (gated residuals, linearized attention, and teacher-guided distillation) to push the Pareto frontier of fidelity vs. efficiency.

By elevating stability to a first-class architectural goal—and by applying the concrete rules distilled here—practitioners can scale diffusion systems more reliably, and researchers can explore deeper, broader, and more conditioned models without trading away robustness.


7. References

  1. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Cham: Springer international publishing, 2015: 234-241.  2

  2. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851. 

  3. Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis[J]. Advances in neural information processing systems, 2021, 34: 8780-8794. 

  4. Perez E, Strub F, De Vries H, et al. Film: Visual reasoning with a general conditioning layer[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1). 

  5. Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis[J]. arXiv preprint arXiv:1809.11096, 2018. 

  6. Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695. 

  7. Podell D, English Z, Lacey K, et al. Sdxl: Improving latent diffusion models for high-resolution image synthesis[J]. arXiv preprint arXiv:2307.01952, 2023. 

  8. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.  2

  9. Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 4195-4205.  2 3

  10. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]//International conference on machine learning. pmlr, 2015: 448-456. 

  11. Wu Y, He K. Group normalization[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19. 

  12. Ba J L, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016. 

  13. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 

  14. Shazeer N. Glu variants improve transformer[J]. arXiv preprint arXiv:2002.05202, 2020.  2

  15. Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks[C]//International conference on machine learning. PMLR, 2017: 933-941. 

  16. Shazeer N. Glu variants improve transformer[J]. arXiv preprint arXiv:2002.05202, 2020.Â