Inverse Problems with Generative Priors

20 minute read

Published:

📚 Table of Contents

Inverse problems, which aim to recover a signal of interest from indirect and often corrupted measurements, are a cornerstone of computational science and engineering. These problems are typically ill-posed, necessitating the use of prior knowledge to regularize the solution space and ensure a unique and stable reconstruction. This paper provides a structured exposition of the evolution of priors in solving inverse problems, from classical formulations to the modern paradigm of deep generative models. We begin by formalizing the inverse problem from a probabilistic perspective, deriving the Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) frameworks, which correspond to solutions without and with priors, respectively. We then review classical, non-generative regularization techniques, focusing on the influential Plug-and-Play (PnP) and Regularization by Denoising (RED) frameworks, which leverage off-the-shelf denoisers as implicit priors. The core of this paper is dedicated to the contemporary approach of employing deep generative models—including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Normalizing Flows, and Diffusion Models—as powerful, explicit priors. We detail how the unique properties of each model class can be integrated into the inverse problem objective, enabling state-of-the-art performance by constraining solutions to a learned data manifold. This work aims to bridge the conceptual gap between classical and modern techniques, offering a unified view of the role of priors in solving generative inverse problems.


1. The Probabilistic Formulation of Inverse Problems

An inverse problem seeks to recover an unknown signal $\mathbf{x} \in \mathbb{R}^n$ from a set of observed measurements $\mathbf{y} \in \mathbb{R}^m$. This process is typically modeled by a linear forward operator $\mathbf{A}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ and corrupted by additive noise $\mathbf{n} \in \mathbb{R}^m$:

\[\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}\]

The operator $\mathbf{A}$ models the physics of the measurement process, such as a blurring kernel in deblurring, a Radon transform in computed tomography (CT), or a subsampling mask in magnetic resonance imaging (MRI). The problem is deemed “inverse” because we seek to reverse the effect of $\mathbf{A}$. It is often ill-posed, meaning that a solution may not exist, may not be unique, or may not depend continuously on the data. This ill-posedness arises when $\mathbf{A}$ is rank-deficient or ill-conditioned ($m < n$ or singular values decay rapidly), making the recovery of $\mathbf{x}$ from $\mathbf{y}$ an ambiguous task.


1.1 Inverse Problems without Priors: Maximum Likelihood Estimation

In the absence of any prior knowledge about the signal $\mathbf{x}$, our estimation relies solely on the data formation model. A common and mathematically convenient assumption is that the noise $\mathbf{n}$ is independent and identically distributed (i.i.d.) Gaussian with zero mean and variance $\sigma^2$, i.e.,

\[\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})\]

This assumption allows us to formulate the problem within the framework of Maximum Likelihood Estimation (MLE). The likelihood function $p(\mathbf{y}\mid \mathbf{x})$ describes the probability of observing the measurements $\mathbf{y}$ given a specific signal $\mathbf{x}$. Under the Gaussian noise assumption, the relationship $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}$ implies that $\mathbf{y}$ is also Gaussian-distributed, centered at $\mathbf{A}\mathbf{x}$:

\[p(\mathbf{y}\mid \mathbf{x}) = \mathcal{N}(\mathbf{y}; \mathbf{A}\mathbf{x}, \sigma^2\mathbf{I})\]

The probability density function for this multivariate Gaussian distribution is given by:

\[p(\mathbf{y}\mid \mathbf{x}) = \frac{1}{(2\pi\sigma^2)^{m/2}} \exp \left(\,-\frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2\, \right)\]

The MLE principle seeks the estimate $\hat{\mathbf{x}}_{\text{MLE}}$ that maximizes this likelihood. For computational stability and simplicity, it is standard practice to maximize the log-likelihood instead:

\[\hat{\mathbf{x}}_{\text{MLE}} = \arg\max_{\mathbf{x}} \log p(\mathbf{y}\mid \mathbf{x}) = \arg\max_{\mathbf{x}} \left[ -\frac{m}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 \right]\]

Since the first term is a constant with respect to $\mathbf{x}$, and maximizing the negative of a function is equivalent to minimizing the function itself, the MLE objective simplifies to a least-squares problem:

\[\hat{\mathbf{x}}_{\text{MLE}} = \arg\min_{\mathbf{x}} \underbrace{\,\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2\, }_{\text{data consistency}}\]

This objective is also known as the Mean Squared Error (MSE) data fidelity term. While principled, it is insufficient for ill-posed problems, as many different signals $\mathbf{x}$ can yield a similarly small residual $|\mathbf{y} - \mathbf{A}\mathbf{x}|_2^2$, often resulting in solutions dominated by noise.


1.2 Inverse Problems with Priors: Maximum A Posteriori Estimation

To overcome the limitations of MLE, we introduce prior knowledge about the expected properties of the signal $\mathbf{x}$. This is formalized using a Bayesian approach. We define a prior distribution $p(\mathbf{x})$ that assigns higher probability to signals that are plausible (e.g., natural images with smooth regions and sharp edges) and lower probability to others (e.g., pure noise).

Using Bayes’ theorem, we can combine the likelihood $p(\mathbf{y}\mid \mathbf{x})$ with the prior $p(\mathbf{x})$ to obtain the posterior distribution $p(\mathbf{x}\mid \mathbf{y})$:

\[p(\mathbf{x}\mid \mathbf{y}) = \frac{p(\mathbf{y}\mid \mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}\]

The posterior represents our updated belief about $\mathbf{x}$ after observing the data $\mathbf{y}$. The goal of Maximum A Posteriori (MAP) estimation is to find the signal $\hat{\mathbf{x}}_{\text{MAP}}$ that maximizes this posterior probability:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} p(\mathbf{x}\mid \mathbf{y}) = \arg\max_{\mathbf{x}} \frac{p(\mathbf{y}\mid \mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}\]

Since the evidence $p(\mathbf{y})$ is constant with respect to $\mathbf{x}$, the MAP objective becomes:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} p(\mathbf{y}\mid \mathbf{x})p(\mathbf{x})\]

Again, we work with the log-posterior for convenience:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} \left[ \log p(\mathbf{y}|\mathbf{x}) + \log p(\mathbf{x}) \right]\]

Substituting the log-likelihood term from the Gaussian noise model, we get:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}} \left[ -\frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \log p(\mathbf{x}) \right]\]

This is equivalent to minimizing the negative log-posterior:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\min_{\mathbf{x}} \left[ \frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 - \log p(\mathbf{x}) \right]\]

This formulation elegantly connects Bayesian inference to regularized optimization. If we define a regularization function $R(\mathbf{x}) = -\log p(\mathbf{x})$ and a regularization parameter $\lambda = 2\sigma^2$, the MAP estimation is precisely equivalent to solving the following optimization problem:

\[\hat{\mathbf{x}}_{\text{MAP}} = \arg\min_{\mathbf{x}} \left[ \underbrace{\|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2}_{\text{data consistency}} + \underbrace{\lambda R(\mathbf{x})}_{\text{Prior / Regularization}} \right]\]

Here, the data fidelity term ensures consistency with the measurements, while the regularization term $R(\mathbf{x})$ enforces the prior. The choice of $R(\mathbf{x})$ is paramount and has been the subject of extensive research.


2. Classical Regularization via Implicit Priors

Before the widespread adoption of deep generative models, the regularization term $R(\mathbf{x})$ in the MAP objective was primarily designed based on handcrafted statistical assumptions about the signal. These priors, while not explicitly modeling the entire data distribution $p(\mathbf{x})$, were instrumental in making ill-posed problems solvable. They can be broadly categorized based on their underlying assumptions.


2.1 Smoothness Priors

The Core Assumption is that: Natural signals, such as images, are predominantly smooth or piecewise-constant. High-frequency variations are often attributed to noise rather than the underlying signal structure. Representative Regularizers including:

  1. Tikhonov Regularization: This is one of the earliest and most fundamental regularizers, penalizing the L2-norm of the signal’s derivatives. The functional is $R(\mathbf{x}) = |\mathbf{L}\mathbf{x}|_2^2$, where $\mathbf{L}$ is a linear operator, typically a discrete approximation of the gradient or Laplacian operator. T

  2. Total Variation (TV): A significant advancement over Tikhonov, the TV regularizer is defined as the L1-norm of the signal’s gradient magnitude: $R(\mathbf{x}) = |\nabla \mathbf{x}|_1$. Unlike the L2-norm, which penalizes large gradients heavily, the L1-norm promotes solutions where gradients are sparse, effectively preserving sharp edges while enforcing smoothness in flat regions.


2.2 Sparsity Priors

The Core Assumption is that: Many signals, while not sparse in their native domain (e.g., pixel space), can be represented sparsely in a suitable transform domain. This means that when the signal $\mathbf{x}$ is transformed by a basis or dictionary $\mathbf{\Psi}$, the resulting coefficient vector $\mathbf{\alpha} = \mathbf{\Psi}\mathbf{x}$ has very few non-zero entries.

The most common regularizer to enforce sparsity is the L1-norm of the transform coefficients:

\[R(\mathbf{x}) = \|\mathbf{\Psi}\mathbf{x}\|_1\]

The choice of $\mathbf{\Psi}$ is critical and domain-specific, with common choices including the Wavelet transform for piecewise-smooth images, the Discrete Cosine Transform (DCT) for photographic images, or a learned dictionary. This forms the cornerstone of the Compressed Sensing theory.


2.3 Self-Similarity and Low-Rank Priors

The Core Assumption is that: Natural images exhibit a high degree of redundancy, where small patches often repeat themselves throughout the image, either exactly or with slight variations. A collection of such similar patches, when vectorized and stacked as columns of a matrix, forms a matrix with a low rank. Representative Regularizers including:

  1. Non-Local Priors: Algorithms like Non-Local Means (NL-Means) leverage this assumption for denoising by averaging pixels weighted by the similarity of their surrounding patches, not just spatial proximity.

  2. Low-Rank Matrix Regularization: This idea is formalized by constructing a matrix $\mathbf{P}{\mathbf{x}}$ from similar patches in the image $\mathbf{x}$ and imposing a low-rank constraint. Since the rank function is non-convex and difficult to optimize, its convex relaxation, the nuclear norm $|\cdot|$ (the sum of singular values), is used as the regularizer: $R(\mathbf{x}) = |\mathbf{P}_{\mathbf{x}}|_$. Landmark methods like BM3D combine patch grouping (self-similarity) with collaborative sparse coding of the patch groups.


2.4 Learning-Based Implicit Priors

This class represents a bridge between handcrafted priors and explicit generative models. Instead of defining a fixed analytical regularizer, these methods learn an implicit prior from data, encapsulated within a powerful denoising algorithm.

Core Assumption: The manifold of natural signals can be implicitly characterized by the action of a state-of-the-art denoiser. A clean signal should be a fixed point of the denoiser, i.e., $\mathcal{D}(\mathbf{x}) \approx \mathbf{x}$, while a noisy signal will be moved towards the manifold. Representative Frameworks including:

  1. Plug-and-Play (PnP): The PnP framework leverages optimization algorithms that split the MAP objective into separate data-fidelity and regularization steps (e.g., ADMM, HQS). The key insight is that the proximal operator associated with the regularizer, $\text{prox}_{\lambda R}(\cdot)$, is mathematically equivalent to a denoising operation. PnP replaces this abstract proximal step with a call to a powerful, pre-trained denoiser $\mathcal{D}(\cdot)$, such as BM3D or a deep network like DnCNN.

  2. Regularization by Denoising (RED): RED provides a more formal foundation by constructing an explicit regularizer $R(\mathbf{x})$ from a denoiser $\mathcal{D}(\mathbf{x})$, whose gradient is defined as $\nabla R(\mathbf{x}) = \mathbf{x} - \mathcal{D}(\mathbf{x})$. This allows the use of standard gradient-based optimization methods on the MAP objective: $\arg\min_{\mathbf{x}} |\mathbf{y} - \mathbf{A}\mathbf{x}|_2^2 + \lambda R(\mathbf{x})$, with the gradient of the second term provided by the denoiser’s residual.


2.5 Summary of Classical Regularization Priors

While powerful, these classical and implicit methods do not model the full data distribution $p(\mathbf{x})$. This limits their ability to, for instance, quantify uncertainty or generate diverse solutions consistent with the measurements. This limitation motivates the paradigm shift towards using explicit deep generative models as priors, as we will explore in the next section.

Prior CategoryCore AssumptionRepresentative Regularizers / MethodsAdvantagesDisadvantages
Smoothness PriorsSignals are predominantly smooth or piecewise-constant. High-frequency variations are noise.• Tikhonov: $|\mathbf{L}\mathbf{x}|_2^2$
• Total Variation (TV): $|\nabla \mathbf{x}|_1$
• Simple, convex, and computationally efficient.
• Effective for signals that are inherently smooth or blocky.
• Tikhonov: Oversmooths, blurring sharp edges.
• TV: Can create “staircasing” artifacts.
• Both are too simple for complex textures.
Sparsity PriorsSignals have a sparse representation in a suitable transform domain (e.g., Wavelets, DCT).• L1-norm of transform coefficients: $|\mathbf{\Psi}\mathbf{x}|_1$
• Foundational to Compressed Sensing.
• Strong theoretical guarantees for recovery under certain conditions.
• Excellent for signals with a known sparsifying basis.
• Relies on finding a suitable transform $\mathbf{\Psi}$.
• Not effective for signals without a sparse representation.
• Can introduce bias in coefficient amplitudes.
Self-Similarity & Low-Rank PriorsNatural signals contain repetitive patterns. Grouped similar patches form a low-rank matrix.• Methods: NL-Means, BM3D
• Regularizer: Nuclear Norm of patch matrix, $|\mathbf{P}{\mathbf{x}}|*$
• Captures complex non-local structures and textures effectively.
• Achieved state-of-the-art results for many years.
• Computationally very expensive due to patch searching and grouping.
• Can introduce patch-based artifacts.
• Difficult to formulate as a simple, explicit regularizer.
Learning-Based Implicit PriorsThe manifold of natural signals can be implicitly defined by the action of a powerful denoiser.• Plug-and-Play (PnP): Replaces proximal operator with a denoiser $\mathcal{D}(\cdot)$.
• Regularization by Denoising (RED): Defines prior gradient as $\nabla R(\mathbf{x}) = \mathbf{x} - \mathcal{D}(\mathbf{x})$.
• Highly flexible and modular; leverages SOTA denoisers (incl. deep nets).
• Excellent empirical performance without needing to train a full generative model.
• A “black box” prior.
• Convergence guarantees can be weak, especially for PnP with complex denoisers.
• Performance is entirely dependent on the quality and domain of the pre-trained denoiser.

3. Deep Generative Models as Explicit Priors

PnP and RED use denoisers as powerful but implicit priors. The latest paradigm shift involves using deep generative models that explicitly learn the entire data distribution $p(\mathbf{x})$ from a training set. This allows for a more direct and expressive enforcement of the prior. The core idea is to constrain the solution $\mathbf{x}$ to lie on the low-dimensional manifold learned by the model.


3.1 Generative Adversarial Networks: Manifold Priors

A standard GAN, consisting of a generator $\mathcal{G}(\mathbf{z})$ and a discriminator $\mathcal{D}$. However, both of them are not provide a tractable method for evaluating $p(\mathbf{x})$ for an arbitrary signal $\mathbf{x}$.

Therefore, applying a GAN prior does not involve formulating an explicit regularizer

\[R(\mathbf{x})=-\log p(x)\]

that is then added to the data consistency term. Instead, the prior is enforced through a reparameterization of the solution space. It acts as a powerful manifold constraint, fundamentally changing the nature of the optimization problem itself.

Suppose a pre-trained GAN generator

\[\mathcal{G}: \mathcal{Z} \rightarrow \mathbb{R}^n\,\qquad \text{where} \mathcal{Z} \in \mathbb{R}^d$\, \text{and}\, $d \ll n\]

learns a mapping from a simple, low-dimensional latent space $\mathcal{Z}$ to the high-dimensional signal space $\mathbb{R}^n$. The core assumption when using a GAN as a prior is that all plausible signals (e.g., natural images) lie on or very close to the low-dimensional manifold defined by the range of the generator. We denote this manifold as $\mathcal{M}$:

\[\mathcal{M} = \{ \mathcal{G}(\mathbf{z}) \mid \mathbf{z} \in \mathcal{Z} \}\]

This manifold assumption can be conceptually translated into an explicit, albeit impractical, regularizer. This regularizer would be an indicator function for the manifold $\mathcal{M}$:

\[R_{\text{GAN}}(\mathbf{x}) = \begin{cases} -\log p(z) & \text{if } \mathbf{x}=\mathcal{G}(\mathbf{z}) \in \mathcal{M} \\ \\ +\infty & \text{if } \mathbf{x} \notin \mathcal{M} \end{cases}\]

If we were to plug this into the MAP objective, we would get:

\[\begin{align} \hat{\mathbf{x}} & = \arg\min_{\mathbf{x} \in \mathbb{R}^n} \left( \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda R_{\text{GAN}}(\mathbf{x}) \right) \\[10pt] \Longrightarrow \hat{\mathbf{z}} & = \arg\min_{\mathbf{z}} \|\mathbf{y} - \mathbf{A}(\mathcal{G}(\mathbf{z}))\|_2^2 + \lambda R(\mathbf{z}) \end{align}\]

This objective forces the solution $\mathbf{x}$ to be strictly within the manifold $\mathcal{M}$ (to avoid the infinite penalty), and among all possible solutions on the manifold, it finds the one that best fits the measurements $\mathbf{y}$.

\[-\log p(\mathbf{z}) = -\log \left( \frac{1}{(2\pi)^{d/2}} \exp\left(-\frac{1}{2}\|\mathbf{z}\|_2^2\right) \right) = \frac{1}{2}\|\mathbf{z}\|_2^2 + \text{const.}\]

This leads to the most common and practical objective function for GAN-based inverse problems:

\[\hat{\mathbf{z}} = \arg\min_{\mathbf{z} \in \mathcal{Z}} \left( \|\mathbf{y} - \mathbf{A}(\mathcal{G}(\mathbf{z}))\|_2^2 + \lambda \|\mathbf{z}\|_2^2 \right)\]

Minimal pseudocode summaries:

# GAN latent optimization: x = G_phi(z), z ~ N(0, I)
initialize z ~ N(0, I)
for k in range(iters):
    x = G_phi(z)
    data = (1/(2*sigma2)) * ||A x - y||^2
    prior = (1/2) * ||z||^2                  # -log p(z)
    # (optional) add identity/perceptual/CLIP regularizers as needed
    z = z - η * ∇_z (data + λ * prior)
return G_phi(z)

3.2 Variational Autoencoders: Latent-Variable Priors

VAE define a latent-variable generative model with an decoder (generator):

\[x\sim p_\theta(x\mid z), \qquad \text{where}\,z\sim p(z)=\mathcal{N}(0,I)\]

and an encoder $q_\phi(z\mid x)$. This allows the marginal likelihood $\log p_\theta(x)$ to be lower-bounded by the well-known Evidence Lower Bound (ELBO):

\[\log p_\theta(x)\;\ge\; \mathbb{E}_{q_\phi(z\mid x)}\!\left[\log p_\theta(x\mid z)\right]- D_{\mathrm{KL}}\!\left(q_\phi(z\mid x)\,\|\, p(z)\right) \triangleq \mathrm{ELBO}(x;\theta,\phi).\]

At first glance, the ELBO appears to offer a tractable surrogate for $-\log p(x)$,

\[\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \left( \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 - \lambda \cdot \text{ELBO}(\mathbf{x}) \right)\]

However, this approach is problematic for several reasons:

  1. Problem 1: The most significant issue is that the ELBO is a function of $\mathbf{x}$ through the encoder $q_{\phi}(\mathbf{z}\mid \mathbf{x})$. During the inverse problem solving phase, we are searching for an unknown, clean signal $\mathbf{x}$.

    When we try to minimize the objective with respect to $\mathbf{x}$, the gradient of the regularization term, $\nabla_{\mathbf{x}} \text{ELBO}(\mathbf{x})$, must be computed. This involves backpropagating through the entire encoder network $q_{\phi}$, the KL divergence calculation, and the expectation term. This creates an extremely complex, high-dimensional, and likely non-convex optimization landscape. The optimization is performed on the variable $\mathbf{x}$ which is simultaneously the input to the encoder.

  2. Problem 2: The “Gap” Problem. ELBO is Only a Lower Bound, the inequality

    \[\log p(\mathbf{x}) \ge \text{ELBO}(\mathbf{x})\]

    is central. This gap can be non-zero and highly variable.

Due to the above limitations, VAE-based inverse problem solvers almost universally adopt a latent-space MAP formulation. Since the VAE defines a simple Gaussian prior over (z), one instead parameterizes the reconstructed signal as

\[x = G_\theta(z),\]

where $G_\theta$ is the VAE decoder (the conditional mean of $p_\theta(x\mid z)$). The prior reduces to

\[-\log p(z) = \tfrac{1}{2}\|z\|_2^2\]

yielding the latent-variable MAP objective:

\[z^\star = \arg\min_{z} \|A(G_\theta(z)) - y\|^2\;+\;\lambda \|z\|_2^2 ,\qquad x^\star = G_\theta(z^\star).\]

This approach eliminates the encoder entirely and transforms the inverse problem into a low-dimensional optimization over (z), yielding stable and efficient solvers.

Minimal pseudocode summaries:

# VAE prior: x = G_theta(z), z ~ N(0, I)
initialize z ~ N(0, I)
for k in range(iters):
    x = G_theta(z)
    data = (1/(2*sigma2)) * ||A x - y||^2     # or -log p(x|z) if decoder has explicit likelihood
    prior = (1/2) * ||z||^2                   # -log p(z)
    z = z - η * ∇_z (data + λ * prior)
return G_theta(z)

3.3 Normalizing Flows: Exact Log-Density Priors

Normalizing Flows are a class of generative models constructed from a sequence of invertible transformations. A flow defines an invertible \(f_\theta: z\mapsto x\) with tractable Jacobian determinant:

\[\log p(x) \;=\; \log p(z) \;-\; \log\big|\det J_{f_\theta}(z)\big|, \qquad z=f_\theta^{-1}(x).\]

where $\mathbf{J}_{f}$ is the Jacobian of the transformation. A key advantage is that the exact log-likelihood of a sample can be calculated: \(R(x) = -\log p(x)\) is exact and differentiable.

This ability to compute $\log p(\mathbf{x})$ directly makes Normalizing Flows a perfect candidate for the MAP estimation framework. The objective function becomes:

\[\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \left[ \|\mathbf{y} - \mathbf{A}\mathbf{x}\|_2^2 + \lambda \left( - \log p(z) + \log \left| \det\left( \mathbf{J}_{f}(\mathbf{x}) \right) \right| \right) \right]\]

This allows optimization directly in the signal space $\mathbf{x}$ while being explicitly guided by a learned, tractable prior. The main drawback is the architectural constraint that all transformations must be invertible with a tractable Jacobian determinant, which can limit their expressiveness compared to other models.


3.4 Diffusion Models: Score-Based Priors

Diffusion models have recently emerged as the state-of-the-art in generative modeling. They learn to reverse a diffusion process that gradually adds noise to data. This is equivalent to learning the score function of the data distribution, $\nabla_{\mathbf{x}} \log p_t(\mathbf{x}_t)$, at different noise levels $t$.

This property is exceptionally well-suited for solving inverse problems via posterior sampling or optimization. The score of the posterior distribution $\log p(\mathbf{x}\mid \mathbf{y})$ can be derived using Bayes’ rule:

\[\nabla_{\mathbf{x}} \log p(\mathbf{x}|\mathbf{y}) = \nabla_{\mathbf{x}} \log p(\mathbf{y}|\mathbf{x}) + \nabla_{\mathbf{x}} \log p(\mathbf{x})\]

The two terms on the right are both tractable:

  1. Likelihood Score: For Gaussian noise, $\log p(\mathbf{y}\mid \mathbf{x}) \propto -|\mathbf{y}-\mathbf{A}\mathbf{x}|2^2 / (2\sigma^2)$. The gradient is $\nabla{\mathbf{x}} \log p(\mathbf{y}\mid \mathbf{x}) = \frac{1}{\sigma^2}\mathbf{A}^T(\mathbf{y} - \mathbf{A}\mathbf{x})$.

  2. Prior Score: The term $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is precisely what the pre-trained diffusion model provides.

This allows for iterative algorithms that guide the reverse diffusion (sampling) process. At each step of the reverse process, the standard denoising update (which corresponds to the prior score) is perturbed by a step in the direction of the likelihood score. This effectively steers the generation process towards samples that are not only realistic (high $p(\mathbf{x})$) but also consistent with the measurements $\mathbf{y}$ (high $p(\mathbf{y}\mid \mathbf{x})$). This approach, seen in methods like Diffusion Posterior Sampling (DPS), has achieved remarkable results across a wide range of challenging inverse problems.


3. References