Visual Representation Learning

93 minute read

📅 Published: December 01, 2025 | 🔄 Updated: January 31, 2026

📘 TABLE OF CONTENTS

Part I — Foundation and Preliminary
Part II — Contrastive-Based Representation Learning
- 1. General Framework
- 2. Representative Algorithms
Part III — Negative-free Self-Supervised Learning
Part IV — Masked Modeling
Part V — Generative Representation Learning
Part VI — Multimodal Representation Learning
- 14. Alignment-Based
- 15. Fusion-Based
  - 15.1 BLIP
  - 15.2 CoCa
Part VII — Joint-Embedding Predictive Learning
- 16. I-JEPA
- 17. V-JEPA
18. References

As we approach the threshold of Artificial General Intelligence (AGI), the landscape of AI has shifted from task-specific engineering to the pursuit of two grand unifications: native multimodality and the seamless integration of understanding and generation. We no longer seek models that merely “label” an image or “predict” a word; we strive for systems that possess a “World Model”—an internal engine capable of simulating physical reality, reasoning across sensory boundaries, and creating high-fidelity content from abstract concepts.

At the heart of this evolution lies Representation Learning. It is the “internal currency” of intelligence. Whether a model is observing a video (understanding) or synthesizing a new scene (generation), it operates on a latent mathematical substrate. If this representation is poorly structured, the model’s multimodality remains “patchy,” and its generative ability remains “blind.”

Part I — Foundation and Preliminary

In this first part, we lay the groundwork by defining what representation learning is in the modern context and why it serves as the critical bridge between raw sensory input and high-level cognitive intelligence.

0.1 What is Representation Learning

Representation learning studies how to transform raw data—pixels, words, waveforms, or video frames—into internal variables that make downstream computation easier. Formally, for an input $x \in \mathcal{X}$, an encoder (or representation function) learns a mapping

\[f_\theta:\mathcal{X}\rightarrow \mathcal{Z},\]

where $\mathcal{Z}$ is a latent space in which semantics, structure, and task-relevant factors are easier to express. The output representation can take multiple shapes:

Global embedding: one vector for per input. Used for image classification, retrieval, clustering, similarity search, etc.
\[z = f_\theta(x) \in \mathbb{R}^d\]
Patch / token representations: a sequence of token/patch embeddings. Typical for Vision Transformers (ViTs): each patch becomes a token embedding, sometimes with a special ([CLS]) token.
\[Z = f_\theta(x) \in \mathbb{R}^{N \times d}\]
In this post, we mainly focus on continuous representation learning, for discrete representation learning, please refer to my other post:
Discrete Visual Tokenizers for Multimodal LLMs
Multi-scale / pyramid features. Common in CNN backbones and dense prediction tasks (detection/segmentation).
\[\{F_\ell\}_{\ell=1}^L, \quad F_\ell \in \mathbb{R}^{H_\ell \times W_\ell \times C_\ell}\]

The crucial point is not the format, but the purpose: a representation is “good” if it can be reused across tasks and settings, so that a task-specific head $g_\phi$ can solve many objectives with little additional supervision:

\[\hat{y} = g_\phi(f_\theta(x)).\]

This framing also clarifies what makes representation learning different from ordinary supervised training for a single task. In supervised learning, the representation is shaped mainly to minimize a task loss (e.g., classification).

In representation learning—especially self-supervised learning—the objective is to learn general-purpose features without relying on human labels, by constructing learning signals from the data itself (e.g., through augmentation, masking, temporal structure, or cross-modal correspondence). Consequently, representation learning is typically judged not by how low its pretraining loss becomes, but by how well the learned features transfer: how much semantic information is linearly accessible, how robust the geometry of the embedding space is, and how effectively the encoder can be adapted to new tasks and domains.

Because the goal is general utility, evaluation is also standardized around transfer protocols rather than pretext accuracy. Common protocols include linear probing (freeze $f_\theta$, train a linear classifier), kNN probing (non-parametric classification in embedding space), and end-to-end fine-tuning under different label budgets. Beyond classification, dense transfer tasks (detection/segmentation) and robustness/OOD transfer are increasingly used to test whether representations preserve spatial structure and generalize beyond the pretraining distribution. These protocols operationalize the otherwise vague idea of a “good representation”: it is one that makes many tasks easy, not one that solves a single pretext objective.

0.2 Why representation learning matters even more in the VLM / MLLMs era

In the foundation-model era, representation learning is no longer merely a pretraining trick for better classifiers. It becomes the internal currency of multimodal intelligence: the shared vectors or tokens through which a single model can integrate heterogeneous sensory streams and support both discriminative understanding and generative synthesis. The reason is simple: large models reason and act through a small set of computational primitives—attention, similarity, and conditional prediction—and all of these primitives operate on representations. If the representation space is poorly designed (misaligned across modalities, unstable under perturbations, weakly grounded), no amount of scaling in the reasoning core will fully compensate.

Native multimodality aims to process text, images, audio, and video within one coherent system, rather than assembling separate modality-specific pipelines. This requires that each modality be converted into a compatible internal form—typically a sequence of latent vectors or tokens—so that the same Transformer-style machinery can contextually integrate them. For each modality $m$, representation learning defines an interface

\[E^{(m)}_\theta: \mathcal{X}^{(m)} \rightarrow \mathcal{Z},\]

where $\mathcal{Z}$ is shared or at least mutually interpretable. Practically, each encoder/tokenizer emits a sequence

\[S^{(m)} = E^{(m)}_\theta(x^{(m)}) \in \mathbb{R}^{L_m \times d}\quad (\text{continuous tokens})\]

\[S^{(m)} \in \{1,\dots,V_m\}^{L_m}\quad (\text{discrete tokens}),\]

which can be concatenated into a unified stream (S) with modality/type embeddings and processed by a shared model. Under this view, “native multimodality” is fundamentally a representational constraint: cross-modal interaction becomes stable and meaningful only if representations have comparable scale, geometry, and semantics, enabling attention to function as true information fusion rather than brittle correlation matching. When this fails, models may appear competent but exhibit classic symptoms: weak grounding (language conclusions not tightly bound to visual/audio evidence), fragile cross-modal reference resolution, and degraded temporal reasoning in long audio/video contexts.

The second major direction—unifying understanding with generation—imposes an even stronger requirement: representations must support both reading (inference, grounding, decision-making) and writing (synthesis, editing, planning) within one framework. A common unification strategy is to cast everything as conditional prediction over a unified token stream. Let $S$ denote the interleaved multimodal sequence; then a generic objective is

\[\max_\theta \sum_t \log p_\theta(S_{t+1}\mid S_{\le t}),\]

where “understanding” corresponds to generating answer tokens conditioned on perceptual tokens, and “generation” corresponds to generating perceptual tokens conditioned on language or other modalities. Whether the output modality is produced autoregressively in a discrete token space or decoded through an iterative generator (e.g., diffusion), the core dependency remains representational: the model must maintain internal states that are semantically aligned enough for reasoning and instruction following, yet expressive enough to support high-fidelity controllable synthesis.

This perspective also explains why full unification is difficult. Understanding benefits from abstraction and invariance—representations that suppress nuisance variation while preserving stable semantic factors (objects, relations, events). Generation, by contrast, requires retaining or reconstructing fine-grained details (texture, prosody, motion), and often demands controllable factorization (what changes vs what stays). If representations become too invariant, generation loses detail and controllability; if they remain too literal, understanding becomes noisy and brittle. Much of the progress in unified models can be interpreted as seeking a representational “sweet spot,” often through hierarchical or factorized representations, staged objectives, or architectures that allocate different layers/subspaces to semantic reasoning vs perceptual synthesis.

In short, representation learning plays a decisive role in both directions: it defines the shared computational substrate for native multimodality, and it defines the read/write interface that makes understanding and generation mutually compatible rather than competing objectives.

0.3 Representation Learning Methods (Connecting to Part II–VII)

Despite the diversity of methods, most modern representation learning algorithms can be described by a common template: construct correlated signals from data and train the model to produce representations that are consistent under these correlations while avoiding trivial solutions. Let $x$ be an input sample. Many methods create two related “views,” either by applying transformations $t_1,t_2\sim\mathcal{T}$ (augmentations) or by splitting the input into visible and hidden parts (masking):

\[x^{(1)}=t_1(x), \qquad x^{(2)}=t_2(x).\]

These views are encoded to representations

\[h^{(1)} = f_\theta(x^{(1)}), \quad h^{(2)} = f_{\theta'}(x^{(2)})\]

Optionally transform with projection heads:

\[z^{(1)} = g_\theta(h^{(1)}), \quad z^{(2)} = g_{\theta'}(h^{(2)})\]

Train a predictor $p_\theta$ if needed:

\[\tilde{z}^{(1)} = p_\theta(z^{(1)})\]

and learning minimizes a matching loss between them:

\[\mathcal{L}=\text{Match}\big(z^{(1)}, \text{Target}(z^{(2)},x^{(2)})\big)+\text{Regularize}(\cdot).\]

Different families mainly differ in (i) how the target is formed, (ii) what “match” means, and (iii) how they prevent collapse or enforce diversity.

Part II (Contrastive-based learning) uses discriminative objectives that pull together representations of positive pairs while pushing apart negatives. The canonical form is InfoNCE:

\[\mathcal{L}_{\text{NCE}} = - \log \frac{\exp(\mathrm{sim}(q,k^+)/\tau)} {\exp(\mathrm{sim}(q,k^+)/\tau) + \sum_i \exp(\mathrm{sim}(q,k_i)/\tau)}.\]

Negatives (from the batch or a memory queue) provide a strong geometric constraint that discourages trivial constant representations. In this sense, contrastive methods treat representation learning as building an embedding space with useful relative distances.

Part III (Self-distillation / negative-free families) replaces explicit negatives with structured targets such as a slowly moving teacher network, distribution matching, prototype assignment, or variance/covariance constraints. A regression-style form matches predicted embeddings to a stop-gradient target:

\[\mathcal{L}_{\text{reg}}=\left\|\;p_\theta(z^{(1)})-\mathrm{sg}(z^{(2)})\;\right\|_2^2,\]

while a distribution-style form matches teacher and student distributions:

\[\mathcal{L}_{\text{distill}}=-\sum_{c}p_T(c\mid x^{(2)})\log p_S(c\mid x^{(1)}).\]

Because “make two views consistent” admits collapse ($f_\theta(x)=\text{const}$), these methods rely on anti-collapse mechanisms such as stop-gradient, momentum/EMA teachers, architectural asymmetry (predictors), centering/sharpening, or explicit redundancy reduction. Conceptually, Part III methods emphasize learning stable semantics without requiring explicit negative pairs.

Part IV (Masked modeling) learns representations by predicting missing information from visible context. The mask defines the correlation: the same sample provides both context and target. Pixel-level objectives reconstruct raw signals (e.g., MAE-style):

\[\mathcal{L}_{\text{pix}}=\sum_{i\in \mathcal{I}_M}\|\hat{x}_i-x_i\|_2^2,\]

token-level objectives predict discrete visual/audio tokens:

\[\mathcal{L}_{\text{tok}}=-\sum_{i\in \mathcal{I}_M}\log p_\theta(v_i\mid x_{\text{visible}}),\]

and feature-level objectives predict teacher-produced latent targets, bridging masked modeling and distillation. Masked modeling can be viewed as enforcing “predictive completeness”: representations should capture enough structure to infer what is missing.

Part V (Generative representation learning) can be framed as learning representations that support explicit density modeling or reconstruction through a generative decoder. In the context of unification, the key idea is that generation is feasible only if the latent representation is both expressive and controllable. Even when the final generator is diffusion or autoregressive, it still depends on a representation interface that can be conditioned on, manipulated, and decoded.

Part VI (Multimodal representation learning) extends the same principles across modalities. Alignment-based methods learn shared geometry across modalities (often via contrastive objectives over image–text or audio–text pairs), while fusion-based methods learn joint contextual representations through cross-attention and masked/predictive objectives. In both cases, representation learning defines whether multimodal tokens are mutually interpretable and whether grounding and cross-modal reasoning remain stable.

Part VII (Joint-Embedding Predictive Learning, e.g., JEPA) emphasizes learning representations by predicting latent targets rather than reconstructing raw pixels or discrete tokens. This can be understood as pushing masked learning toward semantic prediction: the target lives in representation space, reducing the pressure to model low-level details while preserving the benefits of context-based prediction. From the unification perspective, JEPA-style objectives are appealing precisely because they seek a representation space that is “readable for understanding” without forcing the model to solve full perceptual reconstruction.

Across Parts II–VII, the unifying thread is that representation learning is a controlled compromise: it must enforce invariances and semantic stability for understanding, preserve sufficient information for generation, and maintain cross-modal comparability for native multimodality. The specific methods differ in how they construct learning signals and where they place the burden—on negatives, on teachers, on masks, on prototypes, or on latent prediction—but they are all attempts to engineer a representation space that makes downstream intelligence possible.

Part II — Contrastive-Based Representation Learning

In Part I (Sec. 0.3), we introduced the shared template of representation learning and briefly previewed contrastive learning through the InfoNCE objective. This Part II expands that preview into a systematic view: what exactly is being optimized, where negatives come from, why temperature matters, and how representative methods (SimCLR / MoCo family) instantiate the same core principle with different engineering trade-offs.

At a high level, contrastive learning frames representation learning as discriminative matching: Given an anchor representation, identify the correct positive among a set of candidates.

This “classification among candidates” view is the key to both the math and the intuition.

1. General Framework for Contrastive-Based Representation Learning

Given an input sample $x$, we create two correlated views via augmentations:

\[x^{(1)} = t_1(x), \qquad x^{(2)} = t_2(x), \qquad t_1,t_2\sim \mathcal{T}.\]

Encode them into representations:

\[h^{(1)} = f_\theta(x^{(1)}),\quad h^{(2)} = f_{\theta'}(x^{(2)}),\]

and (often) project to a contrastive space:

\[z^{(1)} = g_\theta(h^{(1)}),\quad z^{(2)} = g_{\theta'}(h^{(2)}).\]

Most contrastive methods L2-normalize embeddings and use cosine similarity:

\[\mathrm{sim}(u,v)=\frac{u^\top v}{\|u\|\,\|v\|}\;\;\Rightarrow\;\;\mathrm{sim}(u,v)=u^\top v\quad \text{if } \|u\|=\|v\|=1.\]

Positives: two views of the same instance.

Negatives: views coming from other instances (in the batch, or in a memory structure).

1.1 InfoNCE as “softmax classification among candidates”

Let $q$ be the anchor (query) embedding and $k^+$ be its positive key embedding. Let $\{k_i^-\}$ be negatives. The canonical InfoNCE form is:

\[\mathcal{L}_{\text{NCE}} = - \log \frac{\exp(\mathrm{sim}(q,k^+)/\tau)} {\exp(\mathrm{sim}(q,k^+)/\tau) + \sum_i \exp(\mathrm{sim}(q,k_i^-)/\tau)}.\]

Equivalently: pick the correct key via softmax over similarities. If we arrange keys into a matrix and define logits

\[\ell_i = \frac{\mathrm{sim}(q,k_i)}{\tau},\]

then InfoNCE is simply cross-entropy where the “class label” is the positive key index. This viewpoint is extremely useful:

It explains why more candidates (more negatives) often helps.
It clarifies the role of temperature (\tau).
It makes implementations easy: matrix multiply $QK^\top$ + CE loss.

1.2 Why temperature focuses on “hard negatives”

Write the loss for one query as:

\[\mathcal{L}=-\log \frac{\exp(s^+/\tau)}{\sum_j \exp(s_j/\tau)},\quad s_j=\mathrm{sim}(q,k_j).\]

Let

\[p_j=\frac{\exp(s_j/\tau)}{\sum_m \exp(s_m/\tau)},\]

be the softmax probability. Then:

\[\frac{\partial \mathcal{L}}{\partial s_j}= \begin{cases} p_j-1, & j\in \text{positive} \\[10pt] p_j, & j\in \text{negatives}. \end{cases}\]

So every negative gets a gradient proportional to $p_j$. Now observe:

Smaller $\tau$ makes softmax sharper, concentrating probability mass on the largest similarity negatives (the hardest negatives).
Therefore, with small $\tau$, training tends to “care more” about the negatives that are closest to the anchor in embedding space.

Intuition: $\tau$ controls how “peaky” the contrastive classification is. A very small $\tau$ makes the model act as if it must separate the most confusing impostors first.

1.3 Where negatives come from (the core engineering axis)

In practice, contrastive learning differs mainly in how it constructs the candidate set:

In-batch negatives: all other samples in the batch are negatives (simple, fast, but often needs large batch).
Memory bank / queue: keep a large set of past keys to increase negatives without exploding batch size.
Hybrid: in-batch negatives plus a memory structure.

1.4 Why add a Projection Head (Projector)?

Up to now, we defined the contrastive objective (InfoNCE) and discussed how the candidate set (in-batch vs. queue) shapes optimization. A surprisingly important—and now almost standard—design choice is to insert a projection head

\[h=f_\theta(x)\quad\rightarrow\quad z=g_\phi(h)\]

compute the contrastive loss on $z$, but use (h) (pre-projection features) as the representation for downstream tasks. SimCLR reports that a nonlinear projection head can significantly improve the quality of the learned representation when evaluated on downstream tasks, even when the head is discarded at test time.MoCo v2 explicitly verifies that adding an MLP projector (plus stronger augmentations) transfers SimCLR’s gains into the MoCo framework. Why does this help? The most useful mental model is “two spaces, two jobs”:

Representation space $h$ (what you want to keep): rich, transferable semantics that support many downstream tasks (classification, detection, retrieval, segmentation, etc.).
Contrastive space $z$ (what you need to optimize): a geometry tailored for the contrastive classifier (dot-product similarity + temperature-scaled softmax), where the training signal is clean and stable.

Many studies ¹ ² ³ have shown that incorporating a projection head, rather than directly using the encoder output, plays a crucial role in the quality of model training and representation learning.

Decoupling “task-specific invariances” from “general-purpose features”
Contrastive learning enforces invariance to the chosen augmentation family $\mathcal{T}$. Some invariances are desirable (e.g., small crops), but others can be too strong for certain downstream tasks (e.g., color/texture cues for fine-grained recognition). If the loss is applied directly on $h$, the encoder is forced to reshape its entire representation space to satisfy the contrastive discrimination problem, potentially discarding information that is useful downstream.
A projector allows the model to “pay” the invariance constraints mainly in $z$, while letting $h$ remain a more general feature space. Concretely: $g_\phi$ can learn to compress, re-weight, or discard augmentation-unstable factors to make contrastive matching easier—without forcing the backbone to permanently delete them from $h$.
Gradient shaping: the projector filters and re-weights the signal reaching the backbone
The key is the chain rule. The gradient that updates the encoder features is
\[\frac{\partial \mathcal{L}}{\partial h}=J_g(h)^\top,\frac{\partial \mathcal{L}}{\partial z},\]
where $J_g(h)$ is the Jacobian of the projector. This means $g_\phi$ is not merely “extra capacity”; it actively reshapes which directions in feature space get emphasized by the contrastive objective. In practice, an MLP projector can learn a transformation such that the contrastive loss is optimized in a subspace that is well-suited for the dot-product softmax classifier, while the backbone is freer to maintain features that generalize.
This view aligns with analyses that interpret the projection head as performing an implicit subspace selection—letting the contrastive objective operate on a selected subset of features while preserving flexibility in the backbone representations.
Matching the geometry of InfoNCE: “make $z$ a good space for softmax over similarities”
InfoNCE is effectively a softmax classifier over similarities:
\[p(k\mid q)=\frac{\exp(\mathrm{sim}(q,k)/\tau)}{\sum_j \exp(\mathrm{sim}(q,k_j)/\tau)}.\]
This computation is sensitive to representation geometry: normalization, scale, anisotropy, and the temperature (\tau) all strongly influence gradients. A projector (often with normalization such as BN/LN and an L2-normalization on output) can stabilize the contrastive logits and make optimization less dependent on the raw geometry produced by the backbone. In other words, the backbone does not need to “naturally” produce features that are perfectly behaved under temperature-scaled cosine-softmax; the projector can adapt (h) into a contrastive-friendly (z).
A clean “train-time head” that you can discard
From a systems perspective, the projector behaves like a training-only task head: it absorbs contrastive-specific quirks (augmentation-induced shortcuts, logit scaling, softmax-friendly shaping). Discarding it at test time is then a principled choice: you keep the more general-purpose representation (h), and drop the part that was optimized specifically for the pretext discrimination problem.
Recent work explicitly studies this phenomenon and reports that projection heads are particularly beneficial in regimes where feature imbalance or augmentation mismatch would otherwise hurt the backbone representation.

This “two-space” perspective will also reappear in later parts: when we move to negative-free families, the role of additional heads (projection/prediction/teacher targets) becomes even more explicit as a mechanism to shape optimization without collapsing the backbone’s transferable representation.

2. Representative Algorithms

Below we walk through three representative “checkpoints” that instantiate the same InfoNCE principle but make different choices on negatives, encoders, and stability tricks.

2.1 SimCLR

SimCLR ¹ aims for simplicity. It avoids complex architecture changes (like memory banks) and relies on scaling up:

Use Strong Data Augmentations: Random Crop followed by resize back to original size; Color Distortion (Jitter) and Gaussian Blur. Without color distortion, the model learns to distinguish images based on color histograms rather than semantic content.
use a projection head: A non-linear Multi-Layer Perceptron (MLP) with one hidden layer (ReLU activation) placed after the ResNet pooling layer.
use many negatives via large batch training: Small batches result in insufficient negative samples, leading to poor convergence.

Canonical SimCLR Pipeline :

Sample a mini-batch of $N$ images $\{x_i\}$.
Generate two augmented views for each image:
\[x_i^{(1)}=t_1(x_i),\quad x_i^{(2)}=t_2(x_i).\]
After this you have $2N$ views.
Encode + project:
\[h_i^{(a)}=f_\theta(x_i^{(a)}),\quad z_i^{(a)}=g_\theta(h_i^{(a)}),\quad a\in{1,2}.\]
Apply NT-Xent (an InfoNCE-style loss) symmetrically over both directions: each view treats its paired view as positive; all other $2(N-1)$ views are negatives.
\[\small \mathcal{L} = - \frac{1}{2N} \sum_{i=1}^{2N} \left( \log \frac{\exp(\mathrm{sim}(z_i,z_{j})/\tau)} {\sum_{k\neq i}\exp(\mathrm{sim}(z_i,z_k)/\tau)} + \log \frac{\exp(\mathrm{sim}(z_j,z_{i})/\tau)} {\sum_{k\neq j}\exp(\mathrm{sim}(z_j,z_k)/\tau)} \right)\]

2.2 MoCo & MoCo-V2

SimCLR’s simplicity is appealing, but large-batch training is expensive. MoCo’s central contribution is to decouple the number of negatives from batch size by building a large, consistent dictionary of keys. ⁴

MoCo treats contrastive learning as a Dictionary Look-up task. It decouples the batch size from the number of negative samples using a Queue and a Momentum Encoder. MoCo maintains the dictionary as a FIFO queue of encoded keys, so it can reuse keys from previous mini-batches. This makes the dictionary size a tunable hyper-parameter independent of the current batch size.

The key challenge: keys in the queue were encoded at different times—how to keep them consistent? MoCo uses a momentum-updated key encoder:

\[\theta_k \leftarrow m,\theta_k + (1-m),\theta_q,\]

where only $\theta_q$ is updated by backprop, and $\theta_k$ moves smoothly.

Objective Function: InfoNCE: Mathematically similar to NT-Xent, but the denominator summation comes from the Queue, not the current batch.

\[\mathcal{L}_q = -\log \frac{\exp(q \cdot k_+ / \tau)}{\exp(q \cdot k_+ / \tau) + \sum_{k_-} \exp(q \cdot k_- / \tau)}\]

$q$: Query representation.
$k_+$: Key representation of the same image (from Momentum Encoder).
$k_-$: Negative keys stored in the Queue.

MoCo v2 ⁵ is essentially “MoCo + SimCLR-style improvements”: it verifies that adding an MLP projection head and stronger augmentation into the MoCo framework yields a stronger baseline—often without needing SimCLR-scale batch sizes.

2.3 MoCo-V3

MoCo v3 is introduced in the context of training self-supervised Vision Transformers (ViT) and focuses heavily on stability and clean baselines.

There are two key design moves worth highlighting:

In-batch keys, no memory queue (when batch is large enough)
MoCo v3 explicitly states it uses keys that “naturally co-exist in the same batch” and abandons the memory queue, observing diminishing gains from the queue when batch size is sufficiently large (e.g., 4096).
So, in spirit, MoCo v3 becomes closer to SimCLR on the “negative source” axis, but it retains the momentum-encoder flavor and adds a predictor head for robustness.
Symmetrized contrastive loss + predictor head (BYOL/Siamese flavor)
MoCo v3 uses two views and computes a symmetrized contrastive loss:
\[\mathcal{L} = \mathrm{ctr}(q_1,k_2) + \mathrm{ctr}(q_2,k_1),\]
where $\mathrm{ctr}$ can be implemented as cross-entropy over the in-batch similarity matrix $QK^\top$.
It also includes a prediction MLP head on the query encoder $f_q$ (backbone + proj + pred), while the momentum encoder $f_k$ excludes the predictor (and is updated by EMA).
Importantly, the paper notes that the predictor helps but is not strictly required for contrastive methods to work (contrasting with negative-free methods where predictor is more central).

Part III — Negative-free Self-Supervised Learning

In Part II, contrastive learning avoided trivial solutions by explicitly introducing negatives—the embedding space is shaped by “pull positives together, push negatives apart.”
However, negatives are not the only way to prevent collapse. In fact, a major line of progress in self-supervised learning shows that we can obtain strong representations without explicit negative pairs, by replacing “repulsion” with structured targets and regularizers.

This part focuses on negative-free SSL, a family of methods that still follows the same backbone template from Sec. 0.3: two augmented views of the same sample are forced to be consistent, but collapse is prevented by one (or a combination) of the following strategies:

Self-distillation targets: a teacher network, often EMA / momentum updated.
Stop-gradient + architectural asymmetry: predictor head.
Distribution matching with centering/sharpening: student matches teacher probabilities.
Prototype / clustering assignments: predict cluster IDs instead of raw features.
Redundancy reduction constraints: explicit variance/covariance/correlation regularizers.

A recurring theme is: once we remove explicit negatives, “matching two views” becomes under-constrained, so the method must introduce additional structure to avoid a constant mapping.

3. Regression-style Self-Distillation

Regression-style methods directly regress a predicted embedding from one view to a target embedding from the other view. A canonical form is:

\[\begin{align} x^{(1)}=t_1(x)\quad \Longrightarrow \quad z^{(1)}=g(f(x^{(1)})) \\[10pt] x^{(2)}=t_2(x)\quad \Longrightarrow \quad z^{(2)}=g'(f'(x^{(2)})) \end{align}\]

the loss founction is:

\[\mathcal{L}_{\text{reg}} = \left\| \, {p(\bar{z}^{(1)})} - \mathrm{sg}(\bar{z}^{(2)}) \,\right\|_2^2\]

where:

$g(\cdot)$ is a projection head (as in Part II, we often “optimize on $z$ but use $h$”),
$p(\cdot)$ is a predictor head (often essential in negative-free methods),
$\mathrm{sg}(\cdot)$ is the stop-gradient operator,
$\bar{z}$ denotes a normalized vector (e.g., L2 normalization).

The key question is: why doesn’t the model collapse to a constant representation? In this family, the answer typically lies in asymmetry: either the target branch is updated by EMA (BYOL) or its gradient is stopped (SimSiam), and the online branch has an extra predictor that the target branch does not.

3.1 BYOL

BYOL (Bootstrap Your Own Latent) introduces a simple but powerful idea: learn by predicting the representation produced by a slow-moving target network from the representation produced by an online network, without any negatives. ⁶

BYOL uses two networks:

Online network: encoder + projector + predictor
\[h_o=f_{\theta}(x),\quad z_o=g_{\theta}(h_o),\quad q_o=p_{\theta}(z_o)\]
Target network: encoder + projector (no predictor)
\[h_t=f_{\xi}(x),\quad z_t=g_{\xi}(h_t)\]

The target parameters are updated by exponential moving average (EMA):

\[\xi \leftarrow m\,\xi + (1-m)\,\theta.\]

Given two views $(x^{(1)},x^{(2)})$, BYOL regresses online prediction to target projection:

\[\mathcal{L} = \left\|\,\bar{q}_o^{(1)} - \mathrm{sg}(\bar{z}_t^{(2)}) \,\right\|_2^2 \quad + \quad \left\|\,\bar{q}_o^{(2)} - \mathrm{sg}(\bar{z}_t^{(1)}) \,\right\|_2^2 .\]

Why it avoids collapse (intuition) ? BYOL replaces explicit “negatives” with a moving target:

The online branch is always chasing a target that is changing slowly (EMA), creating a stable learning signal.
The stop-gradient blocks trivial feedback loops.
The predictor introduces asymmetry, so the easiest constant solution is no longer an attractive fixed point in practice.

You can interpret BYOL as a bootstrapping process: the model generates targets from its own past versions, and gradually improves the representation by making two views consistent in a non-degenerate way. High-Level Training procedure:

Sample image $x$, sample augmentations to obtain $x^{(1)},x^{(2)}$.
Online branch:
\[q_o^{(1)}=p(g(f(x^{(1)})))$, $q_o^{(2)}=p(g(f(x^{(2)})))\]
Target branch:
\[z_t^{(1)}=g'(f'(x^{(1)}))$, $z_t^{(2)}=g'(f'(x^{(2)}))\]
Compute regression loss with stop-grad on target.
Update online parameters by backprop.
Update target parameters by EMA.

3.2 SimSiam

SimSiam shows that we can remove the EMA teacher entirely and still avoid collapse, as long as we keep two ingredients: (1) stop-gradient, and (2) predictor asymmetry. ⁷

Both branches share the same encoder+projector (same weights), but only the online side uses a predictor:

\[\begin{align} z^{(1)} = g(f(x^{(1)})) \quad \Longrightarrow \quad q^{(1)} = p(z^{(1)}) \\[10pt] z^{(2)} = g(f(x^{(2)})) \quad \Longrightarrow \quad q^{(2)} = p(z^{(2)}) \end{align}\]

SimSiam uses a symmetric negative cosine similarity (or equivalently normalized MSE) with stop-grad:

\[\mathcal{L} = - \bar{q}^{(1)\top} \cdot \mathrm{sg}(\bar{z}^{(2)}) - \bar{q}^{(2)\top} \cdot \mathrm{sg}(\bar{z}^{(1)}).\]

Why it avoids collapse (intuition)? Without stop-gradient, both branches can reinforce each other into a constant mapping. Stop-gradient breaks this symmetry: one side provides a fixed target for the other in each update step.

The predictor further prevents trivial alignment by making the mapping task non-trivial: the network must learn a meaningful transformation so that $p(z^{(1)})$ can predict $z^{(2)}$ across augmentations.

A practical takeaway (also connecting back to Sec. 1.5) is: in negative-free SSL, heads are not “optional polishing”—they often become part of the core anti-collapse mechanism.

3.3 DirectPred

DirectPred studies the dynamics of non-contrastive SSL and argues that the predictor’s role can be understood more explicitly: a good predictor behaves like a whitening / subspace shaping operator, helping the online branch avoid degenerate solutions. ⁸

Instead of learning the predictor by gradient descent, DirectPred sets a linear predictor directly from the statistics of its inputs (e.g., via covariance/eigendecomposition), showing that carefully chosen linear prediction can match (or approach) the performance of more complex MLP predictors in certain regimes.

Why it matters in this Part III narrative: DirectPred provides conceptual evidence that “no negatives” can work not by magic, but because the method implicitly enforces feature diversity / non-degeneracy through prediction geometry, normalization, and optimization bias.

4. Distribution-style Self-Distillation

Regression-style methods match vectors (point targets). Distribution-style methods match probability distributions produced by a teacher.

The prototype is DINO ⁹ (self-Distillation with NO labels): rather than regressing embeddings directly, DINO matches the teacher and student softmax distributions over a shared prototype space. A generic formulation:

Student probabilities:
\[p_S(\cdot\mid x^{(1)}) = \mathrm{softmax}\Big(\frac{s_\theta(x^{(1)})}{\tau_S}\Big)\]
Teacher probabilities (often centered + sharpened):
\[p_T(\cdot\mid x^{(2)}) = \mathrm{softmax}\Big(\frac{s_\xi(x^{(2)}) - c}{\tau_T}\Big)\]
Distillation loss:
\[\mathcal{L}_{\text{distill}} = - \sum_{k} p_T(k\mid x^{(2)}) \log p_S(k\mid x^{(1)}).\]

Here $c$ is a running center vector and $\tau_T$ is typically smaller than $\tau_S$ (“sharper teacher”). These two tricks—centering and sharpening—are central to preventing collapse.

4.1 DINO V1

DINO ⁹ introduces a teacher-student framework with the following signature choices:

Student and teacher share the same backbone architecture (often ViT).
Both branches output logits over $K$ prototypes (implemented as a linear layer after the projection head).
Teacher is updated by EMA of student weights.

DINO popularized multi-crop: instead of only two global crops, use multiple views at different resolutions. Typically, the teacher sees only global crops (more stable), while the student sees both global and local crops (more diverse).

This design improves locality and makes the learned representations particularly strong for dense tasks (e.g., segmentation-like emergence in ViT attention maps).

If we only match teacher and student distributions, collapse is easy: the teacher can become constant; the student copies it. DINO stabilizes training via:

Centering: subtract a running mean from teacher logits to avoid a single prototype dominating.
Sharpening: use a low teacher temperature $\tau_T$ to avoid overly uniform outputs, encouraging confident assignments.

The combination acts like a self-regulating mechanism: it discourages both “everything maps to one prototype” and “everything maps to uniform noise.”

4.2 DINO V2

DINOv2 pushes the DINO recipe into the “foundation feature” regime by scaling and system improvements: ¹⁰

Large-scale curated data (emphasizing diversity and quality).
Scale in model size and training.
Distillation into smaller, practical backbones.

Conceptually, DINOv2 strengthens a key message of Part III: negative-free self-distillation is not only a clever trick to remove negatives—it can become a scalable route to general-purpose vision features when combined with enough data and stable training.

4.3 DINO V3

DINOv3 ¹¹ continues this trajectory and positions itself as a more “versatile” vision foundation model. At a high level, the emphasis is not a brand-new loss form, but rather:

pushing dense feature quality further,
improving flexibility across resolutions and model sizes,
and exploring post-hoc strategies for broader usability (including better integration with downstream pipelines).

In the storyline of this post, DINOv3 reinforces why Part III matters for the VLM/MLLM era (Sec. 0.2): a strong negative-free pretrained visual encoder is a natural building block for multimodal systems that need robust dense and global representations.

5. Clustering and Prototype-assignment

Another negative-free route is to convert SSL into a form of self-labeling: generate pseudo-labels by clustering (or prototype assignment), then train the network to predict those labels under augmentation consistency.

This family is closely related to distribution-style distillation, but conceptually emphasizes discrete assignments (cluster IDs) rather than matching a full soft distribution.

5.1 SwAV

SwAV (Swapped Assignments between Views) learns representations by predicting cluster assignments across views, rather than contrasting raw features. ¹²

Let $C\in\mathbb{R}^{K\times d}$ be $K$ prototype vectors. For each view embedding $z$, compute prototype scores:

\[s = C z \in \mathbb{R}^{K}.\]

Then the model produces a prediction distribution:

\[p = \mathrm{softmax}(s/\tau).\]

Separately, it computes a balanced assignment $q$ (often via Sinkhorn-Knopp), encouraging non-degenerate use of prototypes. The swapped loss predicts the assignment of one view from the other view:

\[\mathcal{L} = - q^{(1)\top}\log p^{(2)} \;-\; q^{(2)\top}\log p^{(1)}.\]

SwAV avoids collapse by enforcing balanced cluster usage (via assignment constraints), so the trivial solution “everything maps to one cluster” is disallowed. It also benefits from multi-crop training: multiple small crops create richer learning signals without explicit negatives.

5.2 DeepCluster

DeepCluster ¹³ is an earlier and very influential self-labeling method: iteratively cluster features and use cluster IDs as supervision. Repeat for epochs/rounds:

Extract features for all images using the current network: $h_i=f_\theta(x_i)$.
Run $k$-means clustering on ${h_i}$ to obtain cluster assignments $y_i\in{1,\dots,K}$.
Train the network to predict $y_i$ from augmented images using cross-entropy:
\[\mathcal{L}_{\text{cls}} = -\sum_i \log p_\theta(y_i \mid t(x_i)).\]

The clustering step introduces an external structure (k-means partition) that changes as representations change. Although degenerate solutions exist in principle, the alternating optimization (re-clustering + re-training) and practical balancing heuristics help the method discover non-trivial partitions.

DeepCluster is best seen as a prototype of “pseudo-label bootstrapping,” a theme that later appears repeatedly in both self-supervised and weakly-supervised systems.

6. Redundancy Reduction

A different negative-free philosophy is: instead of relying on negatives or teachers, explicitly prevent collapse by constraining the statistics of the representation.

The core observation: matching two views encourages invariance, but collapse happens when embeddings lose diversity. So these methods add terms that preserve variance and reduce redundancy across dimensions.

6.1 Barlow Twins

Barlow Twins proposes an objective that makes the cross-correlation matrix between two views close to identity. ¹⁴

Let $Z^{(1)}, Z^{(2)} \in \mathbb{R}^{B\times d}$ be batch embeddings (often normalized per-dimension). Define the cross-correlation:

\[\mathcal{C}_{ij} = \frac{1}{B}\sum_{b=1}^{B} Z^{(1)}_{b,i}\,Z^{(2)}_{b,j}.\]

Define the loss function:

\[\mathcal{L}_{\text{BT}} = \sum_i (\mathcal{C}_{ii}-1)^2 + \lambda \sum_{i\ne j} \mathcal{C}_{ij}^2.\]

The loss enforces two properties:

Invariance: diagonal entries close to 1
Redundancy reduction: off-diagonals close to 0

If embeddings collapse to constants, correlations become degenerate and cannot match identity. So the objective itself discourages collapse without explicit negatives, without teacher EMA, and without stop-grad asymmetry.

6.2 VICReg

VICReg ¹⁵ explicitly decomposes the requirements of SSL into three terms, Let $Z^{(1)},Z^{(2)}\in\mathbb{R}^{B\times d}$ be embeddings from two views.

Invariance: two views should match with each other.
\[\mathcal{L}_{\text{inv}} = \frac{1}{B}\sum_{b}\left\| Z^{(1)}_b - Z^{(2)}_b \right\|_2^2\]
Variance: each embedding dimension should keep enough spread to avoid per-dimension collapse.
\[\mathcal{L}_{\text{var}} = \sum_{i=1}^{d} \max(0,\gamma - \sigma(Z^{(1)}_{\cdot,i})) \;+\; \sum_{i=1}^{d} \max(0,\gamma - \sigma(Z^{(2)}_{\cdot,i}))\]
Covariance: different dimensions should not become redundant
\[\mathcal{L}_{\text{cov}} = \sum_{i\ne j} \big(\mathrm{Cov}(Z^{(1)})_{ij}\big)^2 + \sum_{i\ne j} \big(\mathrm{Cov}(Z^{(2)})_{ij}\big)^2\]

Final loss:

\[\mathcal{L}_{\text{VICReg}} = \alpha\mathcal{L}_{\text{inv}} + \beta\mathcal{L}_{\text{var}} + \lambda\mathcal{L}_{\text{cov}}.\]

Why it avoids collapse? The variance term prevents the trivial constant solution (no spread). The covariance term prevents “information collapse” where all dimensions carry the same signal Together, VICReg provides an explicit and interpretable alternative to “negatives” or “teacher EMA.”

Part IV — Masked Modeling

In Part II and Part III, the core pretext signal comes from relating two views: contrastive methods explicitly shape the geometry with positives vs. negatives, while negative-free methods replace “repulsion” with self-distillation targets and regularizers.

Masked Modeling takes a different route: instead of matching view-to-view, it hides a large portion of the input and trains the model to predict the missing content using context. This is conceptually closer to “learn by completing” (BERT-style) than “learn by discriminating,” and it turns out to scale extremely well for vision Transformers. At a high level, masked modeling is defined by two choices:

What to mask: typically random patches/tokens (sometimes block-wise), with mask ratio often in the 40%–90% range depending on the method.
What to predict (the prediction target): this choice largely defines the family of methods.

We can unify most masked modeling objectives with the following template.

Let an image be patchified into $N$ tokens/patches, and let $M \subset \{1,\dots,N\}$ be the masked set. The model sees a corrupted input $x_{\setminus M}$ and predicts targets $y_i$ for masked positions:

\[\mathcal{L}_{\text{MIM}} =\mathbb{E}_{x, M}\left[\frac{1}{|M|}\sum_{i\in M}\ell\big(\hat{y}_i(x_{\setminus M}),\, y_i(x)\big)\right].\]

The difference between methods is mainly: what is $y_i$ and what is $\ell(\cdot,\cdot)$. In this part, we organize masked modeling into three categories:

Pixel-level reconstruction: $y_i$ is raw RGB pixels of masked patches (MAE, SimMIM).
Token-level reconstruction: $y_i$ is a discrete visual token produced by a tokenizer (BEiT-style).
Feature-level reconstruction: $y_i$ is a continuous feature target (teacher features or handcrafted features), avoiding explicit tokenization (data2vec, MaskFeat).

7. Masked Pixel-Level Reconstruction

Pixel-level methods choose the simplest possible target: recover the missing pixels. At first glance, pixel regression looks “too low-level,” but two design principles make it work:

High masking ratio makes the task non-trivial: with most content removed, the model cannot solve reconstruction by local interpolation alone and must rely more on global structure and semantics.
Asymmetric compute: if the encoder processes only visible tokens, training becomes efficient even at large scale.

This category is also a good mental bridge to diffusion/autoencoding: it is “representation learning through reconstruction,” but the goal is not to become a good generator—rather, it is to force the encoder to learn useful internal features.

7.1 SimMIM

SimMIM ¹⁶ argues that masked image modeling (MIM) does not need sophisticated designs to work well. Instead, it identifies a simple but critical principle: the effectiveness of MIM largely depends on a “moderate prediction distance”—the masked content should be neither trivially recoverable from nearby visible pixels nor impossibly hard to infer.

SimMIM formulates MIM with four major components: masking strategy, encoder, prediction head, and prediction target. It then shows that simple choices for each component are sufficient:

Masking: patch-aligned random masking with a moderately large masked patch size (default 32×32) and a mask ratio around 0.6.
Masking is implemented by replacing each masked patch embedding with a learnable mask token (same dimension as visible patch embeddings).
Encoder: standard ViT or hierarchical Swin-style Transformer.
Prediction target: raw RGB pixels of masked regions (continuous regression).
Prediction head: can be extremely lightweight—often a single linear layer is enough.

SimMIM predicts raw RGB values for masked regions and uses a simple $\ell_1$ loss:

\[\mathcal{L}_{\text{SimMIM}} =\frac{1}{|M|}\sum_{i\in M}\left\|x_i-\hat{x}_i\right\|_1.\]

The key empirical finding is that direct pixel regression performs no worse than more complex classification/tokenization targets, provided masking is designed reasonably.

7.2 MAE

MAE ¹⁷ (“Masked Autoencoders”) is the most influential pixel-reconstruction MIM framework for ViTs. It is defined by two core ideas: asymmetric encoder–decoder and very high masking ratio.

Encoder: only processes the visible subset (no mask tokens), producing latent representations for visible patches.
Decoder: a lightweight Transformer that takes encoder latents, plus learned mask tokens and reconstructs the full set of patches.

This is crucial because self-attention cost scales roughly quadratically with token count.
If you mask $r\%$ patches, the encoder sees only $(1-r)N$ tokens, so compute can drop dramatically while the task becomes harder.

Let $x_i\in\mathbb{R}^{P^2C}$ be the flattened RGB pixels of patch $i$. The decoder outputs $\hat{x}_i$. MAE uses a simple regression loss over masked patches:

\[\mathcal{L}_{\text{MAE}} =\frac{1}{|M|}\sum_{i\in M}\left\|\mathrm{norm}(x_i)-\hat{x}_i\right\|_2^2,\]

where $\mathrm{norm}(\cdot)$ is often per-patch normalization (stabilizes optimization and scale). We summarize MAE training pipeline step-by-step:

Patchify image into $N$ patches and embed them.
Sample a random mask set $M$ (often $r\approx 75\%$).
Feed only visible tokens $x_{\setminus M}$ into encoder , and produce latents.
Append mask tokens at masked positions, run lightweight decoder.
Predict pixels for masked patches, compute $\mathcal{L}_{\text{MAE}}$ on masked positions only.
After pretraining, discard decoder and use the encoder for downstream tasks.

A useful summary: MAE makes reconstruction hard by hiding most content, and makes training cheap by letting the encoder see only the small visible subset. Specifically, MAE representations are good at:

Strong transfer for classification, detection, segmentation (encoder learns global structure).
Scales well with model size due to compute efficiency.
Works well even without heavy augmentations compared to contrastive learning (masking itself provides the main corruption).

8. Masked Token-Level Reconstruction

Token-level MIM is the most literal transfer of BERT to vision: mask some positions, then classify the missing content—except now the “words” are visual tokens produced by a discrete tokenizer. The motivation is straightforward:

Pixels are low-level and continuous, which can make prediction harder (high-dimensional regression).
Discrete tokens can inject semantic abstraction (if the tokenizer is good), turning reconstruction into a “BERT-like classification” problem.

In this family, the key question becomes: what tokenizer produces good visual tokens?
BEiT v1 uses a dVAE tokenizer; BEiT v2 and PeCo focus on improving token quality/semantics.

8.1 BEiT v1

BEiT ¹⁸ (“BERT Pre-Training of Image Transformers”) popularized token-level MIM for vision. BEiT constructs two parallel “views”:

Input view: image patches embedded and fed into a ViT encoder, with a subset masked (replaced by a learned mask embedding).
Target view: discrete visual tokens produced by a pretrained tokenizer (e.g., dVAE).

The model sees masked patches but must predict the corresponding visual tokens. Let $z_i\in\{1,\dots,K\}$ be the tokenizer’s discrete token index for patch $i$. BEiT trains a classifier over vocabulary $K$ at masked positions:

\[\mathcal{L}_{\text{BEiT}} =-\frac{1}{|M|}\sum_{i\in M}\log p_{\theta}(z_i\mid x_{\setminus M}).\]

This is “BERT for images” almost verbatim—except the tokens come from a vision tokenizer rather than natural language.

The hope of BEiT v1 is that predicting $z_i$ is more semantic than predicting raw pixels—if the tokenizer produces semantically meaningful codes.

8.2 BEiT v2

BEiT v2 ¹⁹ identifies a limitation of BEiT v1: dVAE tokens can be overly tied to low-level appearance. To make token targets more semantic, BEiT v2 introduces a semantic visual tokenizer trained with Vector-Quantized Knowledge Distillation (VQ-KD). The tokenizer is trained to discretize a teacher semantic space:

A teacher model (e.g., a self-supervised teacher) produces semantic features.
A tokenizer encoder produces patch features $h_i$.
Vector quantization maps $h_i$ to a discrete code $z_i$ by nearest-neighbor lookup in a learnable codebook $\{v_j\}_{j=1}^K$:
\[z_i=\arg\min_{j}\left\|\ell_2(h_i)-\ell_2(v_j)\right\|_2.\]
A decoder reconstructs the teacher semantic features conditioned on the discrete codes, so the codebook becomes “semantic-aware.”

After tokenizer training, BEiT v2 follows the same MIM recipe as BEiT v1: mask patches, and feed corrupted patches to ViT, and predict discrete semantic codes at masked positions.

BEiT v2’s core claim is: better tokens ⇒ better masked modeling supervision. Token prediction becomes easier and more semantically aligned, reducing the burden of reconstructing high-dimensional pixels while keeping the supervision meaningful.

8.3 PeCo

PeCo ²⁰ (“Perceptual Codebook”) focuses on the same bottleneck as BEiT: token quality.
Its key observation is that a good prediction target for masked image modeling should agree with human perceptual similarity—but vanilla dVAE tokenizers may not.

PeCo improves the discrete tokenizer (dVAE-style) by encouraging the learned token space to preserve perceptual similarity:

perceptually similar patches/images should map to nearby representations during tokenizer learning,
which leads to discrete tokens that better align with semantic groupings.

Once the perceptual tokenizer is trained, the downstream pretraining is still “BEiT-style” masked token prediction. The improvement comes from a better target space rather than changing the ViT pretraining procedure.

In token-level MIM, the main lever is often not the encoder architecture but the tokenizer:

BEiT v1: simple tokenizer → workable but sometimes low-level targets
BEiT v2 / PeCo: semantic/perceptual tokenizer → stronger targets → better transfer

9. Masked Feature-Level Reconstruction

Feature-level MIM keeps the “predict masked content” philosophy, but replaces pixels/tokens with continuous feature targets. This has two major advantages:

No need for an external discrete tokenizer (unlike BEiT).
Targets can be semantic while remaining continuous (avoid quantization errors).

Depending on the method, the target feature can come from:

a teacher network (self-distillation style),
or a handcrafted descriptor (surprisingly effective in some cases).

This category also creates a conceptual bridge between masked modeling and Part III: it often looks like masked self-distillation.

9.1 data2vec v1 & data2vec v2

data2vec ²¹ proposes a general SSL framework across vision/speech/language with a clean principle: Predict contextualized latent representations of the full input, using a masked view as input.

Core mechanism: masked self-distillation on features

A teacher produces contextualized features from the (less corrupted / full) input.
A student sees the masked input and predicts the teacher’s features at masked positions.

Denote teacher targets as $y_i$ and student predictions as $\hat{y}_i$. A typical objective is regression on masked positions:

\[\mathcal{L}_{\text{data2vec}} =\frac{1}{|M|}\sum_{i\in M}\left\|\hat{y}_i - y_i\right\|_2^2.\]

The teacher is commonly an EMA version of the student (self-distillation), and the targets are often constructed from the teacher’s top layers to be “contextualized” (global information, not purely local).

data2vec 2.0 ²² keeps the same philosophy—predict contextualized targets—but focuses on making training much faster:

avoid encoding masked tokens (closer to MAE-style efficiency),
use fast decoders,
amortize the cost of building teacher representations.

9.2 MaskFeat

MaskFeat ²³ is a masked feature prediction framework originally developed for video Transformers, with an important insight: Pixel targets are not the only “simple” targets—handcrafted features can be strong and efficient prediction targets.

Instead of reconstructing RGB pixels or discrete tokens, MaskFeat masks a portion of the input sequence and predicts feature descriptors for masked regions:

\[\mathcal{L}_{\text{MaskFeat}} =\frac{1}{|M|}\sum_{i\in M}\left\|\hat{f}_i - f_i\right\|_2^2,\]

where $f_i$ can be various feature choices (the paper studies multiple types).

MaskFeat reports that HOG features are particularly effective in terms of performance and efficiency, and the method avoids the need for an external tokenizer—especially important for compute-heavy video settings.

Part V — Generative Representation Learning

This part looks at a different—but equally important—perspective: representations that emerge as a by-product of learning to generate. In these methods, the primary goal is to model the data distribution or reconstruct inputs with high fidelity. Generative representation learning studies a seemingly simple idea: A strong generator must build strong internal representations. Can we reuse (or improve) those internal states as transferable features for downstream tasks?

The most straightforward way is to freeze the generative model and then take its intermediate outputs as representations. The generative model introduces some internal variables—sometimes explicit (a latent code), sometimes implicit (hidden states). These internal variables are where “representations” live.

Explicit latent variables: encoder → latent → decoder. Classic example: VAE-family. You get an explicit mapping
\[z = E_\phi(x),\quad x \approx D_\theta(z),\]
so using the representation is straightforward: you take $z$ (or its mean).
Discrete tokens: tokenizer/codebook → token IDs. Classic example: VQ-VAE / VQGAN / MaskGIT. An image becomes a sequence of discrete indices:
\[x \;\longrightarrow\; {k_1,\dots,k_L}, \quad k_i \in {1,\dots,V}.\]
These tokens can be treated as a discrete representation (IDs), or you can use their embeddings/hidden states inside a Transformer.
Implicit features inside the generator/denoiser/discriminator. Diffusion and GANs often do not provide an encoder by default (though some variants do). Still, their internal feature maps can serve as representations:
- diffusion denoiser features at specific noise levels $t$
- GAN discriminator intermediate activations
- GAN “inversion” latents (e.g., StyleGAN’s $w$) after solving an inverse problem

However, “just take intermediate features” is not the full story. Generative objectives are dominated by reconstruction/synthesis fidelity, which pushes the model to preserve fine details, high-frequency cues, and nuisance factors that discriminative representation learning often tries to discard. As a result, whether a generative model yields useful representations depends on where the representation is read out, how it is pooled, and sometimes how the generator is trained.

Across generative families, three recurring research directions appear again and again:

Readout from a frozen generator: decide which layer(s) and which scale(s) to take; for iterative generators (diffusion), also decide which timestep(s). Multi-layer fusion often matters more than any single “best layer.”
Shape the generator’s internal representations: add auxiliary constraints so that intermediate features become more semantic, more stable, or more transferable—sometimes also improving generation.
Distill generative features into an efficient encoder: the generator can be a strong teacher but a slow feature extractor; distillation produces a student backbone that is fast at inference-time.

The rest of this part reviews these ideas for four major generative families: autoencoders, autoregressive models, diffusion models, and GANs.

10. VAE Family (VAE, β-VAE)

Autoencoders are the most direct bridge between generation and representation because they explicitly contain an encoder. A classical VAE learns an inference network $q_\phi(z\mid x)$ and a decoder $p_\theta(x\mid z)$, the objective is to optimize the ELBO:

\[\mathcal{L}_{\text{VAE}} = \log p_\theta(x) \;\ge\; \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] \;-\;\mathrm{KL}(q_\phi(z\mid x)\,\|\,p(z)).\]

Representation readout is therefore not an afterthought: the latent $z$ (or its mean $\mu_\phi(x)$) is an intended bottleneck. The catch is that the “best” representation is task-dependent.

A very low-dimensional $z$ encourages invariance but may lose spatial detail;
A high-capacity $z$ can preserve detail but may entangle nuisances.

β-VAE ²⁴ modifies the VAE objective to emphasize factorized, interpretable latents by upweighting the KL term.

\[\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \beta\;\mathrm{KL}(q_\phi(z\mid x)\,\|\,p(z)).\]

When $\beta>1$, the latent is more constrained, often encouraging factorized / disentangled latents (at the cost of reconstruction fidelity).
When $\beta<1$, latents can carry more information, often improving fidelity but weakening structure.

From a representation-learning lens, β-VAE is explicitly trying to make individual dimensions of $z$ align with underlying generative factors—useful for controllable representations, attribute manipulation, and “factor discovery” settings.

11. Autoregressive Models: (PixelCNN, iGPT)

Autoregressive (AR) generative modeling defines a tractable likelihood by the chain rule:

\[p_\theta(x) = \prod_{i=1}^{N} p_\theta(x_i \mid x_{<i}).\]

Because every prediction depends on previous context, AR models must learn features that capture both local structure and global consistency.

11.1 PixelRNN / PixelCNN: representations inside masked dependency modeling

PixelRNN ²⁵ /PixelCNN ²⁵ predict pixels sequentially, typically in raster order. The representation is not an explicit latent $z$, but the hidden state used to predict the next pixel:

In PixelRNN: recurrent hidden states over the image grid
In PixelCNN: convolutional stacks with masked convolutions (ensuring causality)

From a representation-learning lens:

intermediate feature maps can be reused as embeddings,
but pure pixel-level AR tends to emphasize low-level detail and is computationally expensive.

12.2 iGPT: Transformers on pixels as generative pretraining

iGPT ²⁶ applies the Transformer to image sequences and explores autoregressive objectives (and also masked/BERT-style objectives) to learn transferable features. The key conceptual point for this article is: Even when trained to predict pixels, a sufficiently large and well-optimized model can learn representations useful for downstream recognition.

In iGPT-style usage, a representation is typically extracted from:

the final-layer hidden states (pooled),
or a special token representation, then fine-tuned or linearly probed for classification.

12. Diffusion-Based Representation: Denoising Models as Feature Learners

Diffusion models are usually introduced as generators, but their denoising networks implicitly learn a rich time-conditioned feature hierarchy. At a high level, a diffusion model defines a corruption process and trains a denoiser (often a U-Net or a DiT) to predict noise / velocity / score. If we denote the noisy latent at time $t$ as $z_t$, then the denoiser is a conditional network $F_\theta(z_t, t, c)$ whose intermediate activations form features $\phi_\ell(z_t, t)$ at layer $\ell$. Two properties make diffusion features special:

Time is a semantic knob. Early noise levels tend to preserve local structure; later noise levels emphasize global semantics and category-level alignment.
Layer is a spatial/semantic knob. Shallow layers encode edges and texture-like signals; deeper layers encode object parts and high-level layout.

This “time $\times$ layer” grid behaves like an unusually expressive feature pyramid. The practical question is then: how do we turn this feature pyramid into usable representations? Recent work clusters into three coherent directions:

A. Feature extraction: use a pretrained diffusion model as a frozen feature backbone.
B. Distillation: transfer diffusion representations into a cheaper student network.
C. Representation-aware training: modify training so that representations and generation improve together.

12.1 Pretrained diffusion models as feature extractors

The simplest route is to treat a pretrained diffusion model as a frozen representation engine: run (part of) the denoiser, read intermediate features, and use them for downstream tasks (correspondence, segmentation, retrieval, editing diagnostics). The central design choices are always the same:

How to map a real image to $z_t$: add noise directly (if operating in pixel space) or encode to latent (latent diffusion), sometimes with inversion when trajectory consistency matters.
Which timestep(s) $t$: a single well-chosen timestep can already be strong; multiple timesteps give robustness.
Which layer(s) $\ell$: mid/deep layers tend to be better for semantic tasks; shallow/mid for geometry and texture.
How to aggregate: concatenate, sum, learn a projector, or learn weights across ((t,\ell)).

Two representative works show how far this from frozen generator to useful features idea can go.

12.1.1 DIFT: Diffusion Feature

DIFT ²⁷ operationalizes a clean recipe: a pretrained text-to-image diffusion model contains emergent dense correspondences, and we can extract them with minimal machinery. Concretely, DIFT uses a Stable-Diffusion–style pipeline and extracts U-Net features at a chosen timestep and U-Net block, then uses simple similarity matching for correspondence. The implementation perspective is revealing: you do not need to sample images—just a forward pass (or a small number of passes) to expose the internal descriptors.

Feature extraction template (conceptual):

Encode image $x$ to latent $z$ (for latent diffusion).
Choose a timestep $t$, sample noise $\epsilon$, form $z_t$
\[z_t = \sqrt{\bar \alpha_t}z + \sqrt{1-\bar \alpha_t}\epsilon\]
Run the denoiser once to obtain intermediate activations:
\[\mathbf{F}_{\ell,t}(z_t) \in \mathbb{R}^{H_\ell\times W_\ell\times C_\ell}\]
define a pixel/patch descriptor at location (p) as a normalized channel vector
\[\mathbf{h}_{\ell,t}(p)=\frac{\mathbf{F}_{\ell,t}(p)}{\|\mathbf{F}_{\ell,t}(p)\|_2^2}.\]
Use patch-wise cosine similarity / nearest neighbors to match points across images (or to build dense descriptors).
\[s(p,q)=\mathbf{h}_{\ell,t}(p)^\top \mathbf{h}'_{\ell,t}(q), \quad q^*=\arg\max_q s(p,q).\]

Why it works. The denoiser is trained to be consistent under corruption; intermediate activations therefore become stable descriptors for content that remains predictable at that time $t$. In practice, a mid-to-late timestep often provides a good balance: enough noise to suppress superficial texture mismatch, but not so much that local structure disappears.

What it is good at. DIFT is most often used for semantic correspondence—aligning parts across images without explicit supervision. Reported comparisons show large gains over standard self-supervised features (e.g., DINO / CLIP) on correspondence benchmarks.

A practical note. In real usage, DIFT-style features are sensitive to (i) timestep choice and (ii) which internal block you read from; but the crucial point is that the method makes these knobs explicit and controllable, so extraction can be tuned to task needs rather than treated as a black box.

12.1.2 Diffusion Hyperfeatures

If DIFT is the “single-slice” view (pick one $(t,\ell)$ and read features), Diffusion Hyperfeatures ²⁸ takes the natural next step: the best representation is rarely located at a single layer and a single timestep. Instead, the method treats diffusion activations as a large feature bank across time and depth, and learns how to compress them into a single strong descriptor per location.

Diffusion Hyperfeatures pushes the “frozen feature extractor” idea further by explicitly aggregating multi-layer + multi-timestep features into a single hyperfeature representation. One clean way to write their core idea is:

upsample all $\mathbf{F}_{\ell,t}$ to a common spatial resolution via $U_\ell(\cdot)$,
project to a common channel dimension via a bottleneck $b_\ell(\cdot)$,
aggregate with weights $w_{\ell,t}$:
\[\mathbf{H}(p)=\sum_{t\in\mathcal{T}}\sum_{\ell\in\mathcal{L}} w_{\ell,t}; b_\ell!\Big(U_\ell(\mathbf{F}_{\ell,t})\Big)(p).\]

This turns “which layer/time should I use?” into a learnable or tunable weighting problem, while still keeping the diffusion model frozen.

A key practical ingredient here is the use of diffusion inversion for real images. Inversion provides a trajectory that is consistent with the diffusion dynamics, which stabilizes features across different inputs and makes cross-image matching more reliable.

Interpretation. Diffusion Hyperfeatures is effectively saying: diffusion models already expose a basis of descriptors; the missing piece is how to combine them. Once aggregation is learned, the representation becomes less brittle to timestep choice and more robust across object categories and appearance changes.

12.2 Distilling diffusion representations into efficient students

Frozen diffusion feature extraction is powerful, but it is also expensive: running a large diffusion backbone (plus inversion, plus multi-$t$ feature stacks) is rarely acceptable for large-scale retrieval or dense prediction pipelines.

Distillation methods therefore aim to preserve the representation quality of diffusion features while removing the compute burden. The common pattern is:

Treat the diffusion model as a teacher that provides supervision signals (features, patch descriptors, or pseudo-labels).
Train a standard vision backbone $f_\phi(x)$ (ResNet/ConvNeXt/ViT) as a student so that its representations match the teacher’s, typically under augmentations.

12.2.1 RepFusion

RepFusion ²⁹ distills a recognition encoder $f(\cdot;\theta_f)$ from a pretrained diffusion teacher $s(\cdot,\cdot;\theta^*)$, but highlights a central difficulty: the “best” timestep $t$ is not universal.

They first state the basic distillation objective. If

\[\mathbf{z}^{(t)} \leftarrow s(\mathbf{x},t;\theta^*),\quad \mathbf{z}\leftarrow f(\mathbf{x};\theta_f),\]

then distillation minimizes

\[\min_{\theta_f}\ \mathbb{E}\big[\mathcal{L}_{\text{kd}}(\mathbf{z}^{(t)},\mathbf{z})\big].\]

After that, the student is fine-tuned with a task loss:

\[\min_{\theta_f}\ \mathbb{E}\big[\mathcal{L}_{\text{task}}(y,\hat y)\big],\quad \hat y=f(\mathbf{x};\theta_f).\]

To pick timesteps, RepFusion frames a per-sample selection criterion:

\[t^*=\arg\min_{t\in[0,T]}\Big\{\ \inf_{\theta_g}\ \mathcal{L}_{\text{task}}\big(y, g(\mathbf{z}^{(t)};\theta_g)\big)\ \Big\},\]

and optimizes a policy $\pi_{\theta_\pi}(t\mid \mathbf{x})$ with a REINFORCE-style objective:

\[\max_{\theta_\pi} J(\theta_\pi) = \mathbb{E}\Big[\sum_t \pi_{\theta_\pi}(t|\mathbf{x}),\mathcal{R}^t_{\mathbf{x}}\Big] +\lambda_H H(t),\]

where the reward is tied to (negative) task loss and $H(t)$ encourages exploration.

12.2.2 DreamTeacher

DreamTeacher ³⁰ uses strong generative models as teachers for representation pretraining, combining feature distillation with label distillation. Their feature distillation starts from a “base” loss:

\[\mathcal{L}_{\text{base}}=\mathcal{L}_{\text{regression}}+\lambda\,\mathcal{L}_{\text{dice}},\]

with

\[\mathcal{L}_{\text{regression}}=\big\|f_\theta^{g}(s_i)-v_i\big\|_2^2,\]

and a Dice-form term

\[\mathcal{L}_{\text{dice}}=1-\frac{2\sum_{c=1}^C \hat y_{ic}y_{ic}}{\sum_{c=1}^C \hat y_{ic}^2+\sum_{c=1}^C y_{ic}^2}.\]

For label distillation, they use a soft-target cross entropy:

\[\mathcal{L}_{\text{ld}}=-\sum_{i=1}^C \hat p_i\log p_i,\]

and mix it with standard supervised cross-entropy:

\[\mathcal{L}=\gamma\,\mathcal{L}_{\text{ce}}+(1-\gamma)\,\mathcal{L}_{\text{ld}}.\]

Interpretation

Feature distillation says: “match the teacher’s internal representation.”
Label distillation says: “match the teacher’s softened outputs.”
Together, they transfer both geometry (features) and decision structure (labels) from a generative teacher into a fast student.

12.3 Improving representations to improve generation

The third direction flips the perspective: instead of asking “what representations do diffusion models already contain?”, it asks “how should diffusion models be trained so that representations and generation reinforce each other?”

This question matters because diffusion training is not automatically optimized for semantic separability. In fact, REPA’s analysis highlights a concrete tension: diffusion transformers can show a linear-probing peak at intermediate depth, followed by degradation in later layers—suggesting that late computation may drift toward high-frequency detail synthesis rather than maintaining semantically rich features.

Three representative works illustrate different ways to inject representation structure into diffusion training.

12.3.1 SODA

SODA ³¹ is explicit about representation learning: it couples an encoder $E$ and a diffusion decoder $D$ and trains them end-to-end for novel view generation as self-supervision. The encoder maps a source view to a compact latent:

\[\mathbf{z}=E(\mathbf{x}'),\]

and the diffusion decoder denoises the target view while being conditioned on $\mathbf{z}$. Conditioning is done via a simple but effective layer modulation (AdaGN-style):

\[\mathbf{z}_s\,\text{GroupNorm}(\mathbf{h})+\mathbf{z}_b,\]

where $(\mathbf{z}_s,\mathbf{z}_b)$ are linear projections of $\mathbf{z}$ injected at each decoder layer.

Training still uses the standard diffusion denoising objective (they explicitly refer to the classic MSE diffusion objective). A representative way to write the conditioned denoising loss is:

\[\mathcal{L}_{\text{SODA}} =\mathbb{E}_{\mathbf{x},\mathbf{x}',t,\epsilon}\Big[\big\|\epsilon-\epsilon_\theta(\mathbf{x}_t,t,\mathbf{z})\big\|_2^2\Big], \quad \mathbf{z}=E(\mathbf{x}').\]

What makes it “representation-first”: the bottleneck is intentional: (\mathbf{z}) is meant to be used (linear probe, control, editing), not merely observed.

12.3.2 Representation Learning with Diffusion Models

This line (often discussed under “latent-representation diffusion models”) proposes to train a diffusion model conditioned on a learned representation extracted by a separate encoder, and also introduces a tractable representation prior so unconditional generation is possible.

Because the paper’s core is architectural (encoder → representation → conditional diffusion), the central objective can be written in the standard conditional-denoising form:

\[\mathcal{L}_{\text{cond-diff}} =\mathbb{E}\Big[\big\|\epsilon-\epsilon_\theta(\mathbf{x}_t,t,\mathbf{r})\big\|_2^2\Big], \quad \mathbf{r}=E(\mathbf{x}).\]

To make $\mathbf{r}$ sampleable for unconditional generation, one typically regularizes it toward a tractable prior $p(\mathbf{r})$ (the paper motivates such a prior). A common instantiation is a KL regularizer:

\[\mathcal{L}_{\text{prior}}=D_{\mathrm{KL}}\big(q(\mathbf{r}\mid \mathbf{x})\ \|\ p(\mathbf{r})\big), \quad \mathcal{L}=\mathcal{L}_{\text{cond-diff}}+\beta\,\mathcal{L}_{\text{prior}}.\]

12.3.3 REPA

REPA ³² explicitly improves diffusion models by aligning their internal representations with a strong (typically discriminative) representation prior. They train with velocity parameterization:

\[\mathcal{L}_{\text{velocity}} =\mathbb{E}\Big[\big\|\mathbf{v}-\hat{\mathbf{v}}_\theta(\mathbf{y}_t,t)\big\|_2^2\Big],\]

and add a representation prior alignment term (cosine similarity form):

\[\mathcal{L}_{\text{REPA}} =-\mathbb{E}\left[ \frac{h_\phi(\mathbf{y}^*)^\top \cdot h_\theta(\mathbf{y}^*)}{\|h_\phi(\mathbf{y}^*)\|\ \|h_\theta(\mathbf{y}^*)\|} \right],\]

finally combining them as

\[\mathcal{L}_{\text{total}} =\mathcal{L}_{\text{velocity}}+\lambda\,\mathcal{L}_{\text{REPA}}.\]

Interpretation

$\mathcal{L}_{\text{velocity}}$ ensures the model stays a valid diffusion generator.
$\mathcal{L}_{\text{REPA}}$ nudges the generator’s internal features toward a high-quality “prior” representation space.
The result is a rare case where “better representations” are not just useful for downstream tasks—they are directly optimized as part of the generative training loop.

13. GAN-Based Representation Learning

GANs learn via an adversarial game. Even when trained “for generation”, the discriminator often becomes a strong feature extractor, and many GAN variants explicitly include an encoder to produce representations.

13.1 Discriminator features as representations

DCGAN ³³ is an early canonical result showing that GAN training can learn a hierarchy of features that transfer to downstream tasks. In practice, one can take intermediate discriminator activations as representations.

13.2 BiGAN / ALI: learning an inference network jointly

A limitation of standard GANs is that they map $z\to x$ but do not provide $x\to z$. BiGAN ³⁴ / ALI ³⁵ introduce an encoder $E(x)$ and train a discriminator on joint pairs:

real joint: $(x, E(x))$
fake joint: $(G(z), z)$

The discriminator tries to distinguish these joint samples, pushing the encoder and generator to match the joint distribution. The learned encoder output becomes a usable representation:

\[r(x)=E(x).\]

13.3 BigBiGAN: scaling GAN-based representations

BigBiGAN ³⁶ scales this idea significantly, showing that advances in generation quality can translate into improved representation quality when an encoder is trained in the adversarial framework.

13.4 StyleGAN latent spaces: representations via inversion

StyleGAN-style ³⁷ ³⁸ generators define intermediate latent spaces (often called $W$ or $W^+$) that are highly useful for semantic manipulation. However, for representation learning on real images, you usually need an inversion procedure:

\[w^*(x)=\arg\min_w \mathcal{D}(G(w), x) + \lambda,\mathcal{R}(w),\]

so the representation is not “free”—it requires solving an optimization or using a trained encoder/inverter. This makes it powerful for editing but less clean as a general-purpose representation backbone.

Part VI — Multimodal Representation Learning

Part V reframed representation learning through the lens of generation: we can learn features by modeling data distributions (VAE/AR/Diffusion/GAN) and then extract intermediate states as transferable representations.

In the VLM / MLLMs era, however, the most decisive leap comes from cross-modal supervision: instead of learning “what an image is” in isolation, we learn what an image means in language, and vice versa.

Multimodal representation learning can be understood as learning a set of encoders (or tokenizers) that map different modalities into a compatible latent space:

\[E^{(m)}: \mathcal{X}^{(m)} \rightarrow \mathcal{Z}, \quad m\in\{\text{image},\text{text},\text{audio},\text{video}\}.\]

The key is not that all modalities become identical, but that their representations become mutually interpretable under shared computational primitives (similarity, attention, conditional prediction). This part focuses on the most influential image–text setting, where two dominant paradigms emerged:

Alignment-based (dual-encoder): learn a shared embedding geometry via contrastive objectives so that paired image–text samples are close. This yields fast retrieval and strong zero-shot transfer, but grounding is limited.
Fusion-based (cross-attention / encoder–decoder): learn joint contextual representations by fusing image tokens and text tokens, typically with multi-task objectives. This yields stronger compositional grounding and reasoning, but is computationally heavier.

These two paradigms are not mutually exclusive. Many modern systems adopt a hybrid: a fast dual-encoder for retrieval + a fusion model as a reranker / reasoning module. But keeping the distinction explicit is helpful, because it clarifies what kind of representation is learned—and what it is good for.

14. Alignment-Based

Alignment-based multimodal learning treats image–text pairs as positives and mismatched pairs as negatives. Let $(x_i, y_i)$ be an image–text pair. A dual-encoder model learns

\[v_i = f_\theta(x_i)\in\mathbb{R}^d,\qquad t_i = g_\phi(y_i)\in\mathbb{R}^d,\]

and typically normalizes them:

\[\bar{v}_i=\frac{v_i}{\|v_i\|},\qquad \bar{t}_i=\frac{t_i}{\|t_i\|}.\]

The central object is the similarity matrix:

\[S_{ij}=\frac{\bar{v}_i^\top \bar{t}_j}{\tau}.\]

A canonical CLIP-style symmetric contrastive objective is:

\[\mathcal{L}_{\text{align}} = \frac{1}{2} \left( \underbrace{ -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N}\exp(S_{ij})} }_{\text{image}\rightarrow\text{text}} + \underbrace{ -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N}\exp(S_{ji})} }_{\text{text}\rightarrow\text{image}} \right).\]

This is exactly the “softmax classification among candidates” view from Part II (Sec. 1.1), except that the candidates are cross-modal.
The representation learned here is fundamentally a shared metric space: the model is optimized so that dot-product similarity becomes a reliable proxy for semantic correspondence.

What this representation is good at

Cross-modal retrieval: nearest-neighbor search in embedding space.
Zero-shot recognition: treat class names (with prompts) as text queries and classify by similarity.
As a reusable “perception front-end”: a stable global embedding interface for downstream multimodal systems.

What it struggles with

Fine-grained grounding and compositional reasoning: alignment alone does not force the model to localize which region supports which phrase.
Hard negatives and “false negatives”: in-batch negatives may include semantically related samples, which can distort geometry. Many follow-ups (data curation, better loss forms, better batching) can be seen as mitigating this tension.

Below we walk through three representative checkpoints: CLIP as the archetype, ALBEF as a hybrid that adds fusion and distillation, and SigLIP as a loss-level redesign improving scaling behavior.

14.1 CLIP

CLIP ³⁹ established a clean recipe: learn image and text encoders jointly using large-scale image–text pairs, with a symmetric contrastive loss. The architectural choice is deliberately simple:

Image tower: a CNN or ViT encoder produces a single global embedding.
Text tower: a Transformer text encoder produces a single global embedding (often the [EOS] / pooled token).

The training signal is entirely pairwise correspondence: the model learns that the correct caption should be the most similar text among candidates for an image (and vice versa). Once trained, CLIP can be used as a zero-shot classifier by turning labels into prompts:

Construct prompts for each class $c$ (e.g., “a photo of a {c}”).
Encode each prompt to get text embeddings ${\bar{t}_c}$.
For an image $x$, compute $\bar{v}=f(x)$ and predict:

\[\hat{c}=\arg\max_c \ \bar{v}^\top \bar{t}_c.\]

This “classification by retrieval” viewpoint is conceptually important: CLIP’s representation is not trained to predict a closed-set label distribution; it is trained to build a shared semantic geometry where language can act as a query interface.

CLIP’s influence also clarified a broader lesson: cross-modal alignment is not merely a multimodal trick; it is a scalable route to universal supervision. If the text side is diverse enough, the representation naturally becomes more transferable.

(Several influential scaling-oriented variants can be understood as staying within the same paradigm but pushing one axis: data scale and noise tolerance ⁴⁰, training strategy and locked-tuning ⁴¹, better batching/reproducibility ⁴², finer-grained matching ⁴³.)

14.2 ALBEF

While CLIP-style alignment is powerful, it is also limited: it primarily learns global correspondence, and it relies heavily on large batches / large candidate sets. ALBEF ⁴⁴ addresses these limitations by explicitly adopting a principle:

Align before fuse: learn a good global alignment first, then learn cross-modal fusion with stronger supervision signals.

ALBEF introduces a two-stage representation structure within one framework:

Alignment module (dual-encoder): produce global image/text embeddings and optimize an image–text contrastive (ITC) objective, similar in spirit to CLIP.
Fusion module (cross-attention encoder): fuse image tokens and text tokens to produce a joint representation and optimize objectives that require deeper interaction.

A typical objective set in ALBEF includes:

Image–Text Contrastive (ITC): builds a shared embedding geometry (fast retrieval-friendly).
Image–Text Matching (ITM): a binary classifier that predicts whether an image–text pair is matched, but crucially this classifier operates on fused representations.
Masked Language Modeling (MLM) (in the fusion encoder): forces token-level grounding by predicting masked words conditioned on image content.

A defining ingredient is momentum distillation. Similar to the momentum teacher idea in Part III (e.g., BYOL / DINO), ALBEF maintains momentum encoders to generate soft targets that stabilize training and improve data efficiency, particularly under noisy web-scale supervision. Concretely, the momentum model provides softened similarity / matching signals, and the online model is trained to match them in addition to hard labels. This distillation can be interpreted as injecting a smoother geometry prior into training, reducing brittleness from noisy pairs.

From a representation-learning perspective, ALBEF is important because it makes the trade-off explicit:

dual-encoder representations give scalable alignment and fast retrieval,
fusion representations give grounded token-level reasoning.

This “two representations, two jobs” mirrors the projector idea in Part II (Sec. 1.4), but now the split happens across modalities and interaction depth rather than within a single encoder.

14.3 SigLIP

A subtle but practically critical bottleneck in CLIP-style training is the softmax normalization over candidates. InfoNCE treats each batch as a classification problem: one positive must beat all negatives. This makes optimization heavily dependent on batch composition and can create undesirable coupling across samples.

SigLIP ⁴⁵ replaces the softmax-based contrastive loss with a sigmoid (logistic) loss applied to image–text pairs. Instead of normalizing over the batch, SigLIP treats each pair independently with a binary objective.

Let $s_{ij}=\bar{v}_i^\top \bar{t}_j$ be similarity (optionally scaled). Define a label matrix:

\[y_{ij}= \begin{cases} +1, & i=j\\ -1, & i\neq j \end{cases}\]

Then a sigmoid loss takes the form:

\[\mathcal{L}_{\text{sig}} = \frac{1}{N^2}\sum_{i=1}^{N}\sum_{j=1}^{N} \log\left(1+\exp\left(-y_{ij}\cdot s_{ij}\right)\right).\]

Key consequence: no softmax competition. Each positive is encouraged to have high similarity, each negative low similarity, but there is no global partition function tying all logits together. This seemingly small change has outsized effects in scaling regimes: it reduces sensitivity to batch size and can improve stability when the candidate set becomes extremely large or heterogeneous.

In the storyline of this post, SigLIP highlights an important theme: many “new” multimodal representation methods are not new because they changed the encoder architecture, but because they changed the optimization geometry—exactly the kind of lens we used in Part II when discussing temperature and hard negatives.

15. Fusion-Based

Fusion-based multimodal learning does not stop at “are these two samples aligned in a metric space?”.
Instead, it learns representations by token-level interaction between modalities, typically via cross-attention.

Let an image be encoded into patch tokens

\[V = [v_1,\dots,v_M]\in\mathbb{R}^{M\times d},\]

and text into token embeddings

\[T = [t_1,\dots,t_L]\in\mathbb{R}^{L\times d}.\]

A fusion model learns joint contextual representations through cross-attention blocks:

\[H = \mathrm{Fusion}(V,T),\]

where $H$ can be:

a fused sequence used for discrimination (ITM, VQA-style objectives),
a hidden state used for generation (captioning / language modeling),
or both.

Compared with dual-encoders, fusion models are more expensive at inference (since they must process a pair jointly), but they can produce stronger grounded representations: the model must learn which visual tokens support which textual tokens to succeed.

Fusion-based pretraining typically uses multi-task objectives, most commonly a combination of:

contrastive alignment (for global semantics and retrieval),
matching / discrimination (for grounding),
generation (captioning / language modeling) which forces rich multimodal conditioning.

This category is also a natural bridge between “representation learning” and “understanding + generation unification” discussed in Sec. 0.2, because fusion models often expose representations that are simultaneously useful for retrieval and for text generation conditioned on images.

15.1 BLIP

BLIP ⁴⁶ (Bootstrapping Language-Image Pre-training) is a representative fusion-based foundation because it makes unification explicit: it is designed for both vision-language understanding and vision-language generation.

BLIP’s core idea is that web-scale image–text pairs are abundant but noisy. Instead of treating the dataset as fixed, BLIP introduces bootstrapping:

use a captioning model to generate better captions for images,
use a filtering model to remove noisy mismatched pairs,
then train a unified model on the improved pairs.

Architecturally, BLIP uses a multimodal mixture of encoder/decoder behaviors (often referred to as a “MED” design): the same Transformer can be configured as

a text encoder (bidirectional) for understanding,
a text decoder (causal) for generation,
and a multimodal encoder with cross-attention for fusion.

A typical training objective bundle includes:

Image–Text Contrastive (ITC): provides global alignment and retrieval capability.
Image–Text Matching (ITM): a fusion-based binary classification objective.
Language Modeling (LM) / Captioning: generate text conditioned on the image, forcing the model to preserve richer semantics and details.

From a representation perspective, BLIP yields multiple useful “interfaces”:

a CLIP-like global embedding (from ITC) for retrieval,
fused token representations (from ITM / multimodal encoder) for grounded reasoning,
decoder hidden states that naturally support generation.

This multi-interface property is exactly why BLIP-style designs became a stepping stone toward modern VLMs/MLLMs (e.g., bootstrapping stronger instruction-following variants ⁴⁷ and building frozen-encoder + LLM hybrids ⁴⁸, although those belong to a broader “system integration” story beyond this Part VI).

15.2 CoCa

CoCa ⁴⁹ (“Contrastive Captioners are Image-Text Foundation Models”) presents a particularly clean unification: a single model is trained to be both

a contrastive aligner (like CLIP),
and a captioning generator (like an image-conditioned language model).

CoCa’s training combines two complementary losses:

Contrastive loss on global embeddings (alignment / retrieval):
\[\mathcal{L}_{\text{ctr}} \ \text{(CLIP-style)}\]
Captioning loss (conditional language modeling):
\[\mathcal{L}_{\text{cap}} = -\sum_{t} \log p_\theta(w_t \mid w_{<t}, x).\]

The key representational insight is that these two losses push the model toward a “sweet spot” discussed in Sec. 0.2:

the contrastive objective encourages semantic invariance and global geometry (useful for transfer),
the captioning objective encourages semantic completeness and grounded detail (useful for generation and compositional understanding).

As a result, CoCa can be viewed as a principled bridge between alignment-based and fusion-based paradigms: it retains the retrieval-friendly embedding space while also learning a fusion mechanism through the captioning pathway.

Closing perspective.
Alignment-based models make multimodal learning scalable by reducing cross-modal supervision to geometry. Fusion-based models make it grounded by forcing token-level interaction and often incorporating generative objectives. Together, they form the representational backbone of the VLM/MLLM era: retrieval interfaces, grounding interfaces, and generation interfaces all emerge from how representations are trained and structured.

In the next part (Part VII), we move one step further toward a “representation-first” philosophy: instead of predicting pixels/tokens or matching explicit pairs, Joint-Embedding Predictive Learning (JEPA) predicts targets directly in representation space (e.g., I-JEPA ⁵⁰, V-JEPA ⁵¹). This shift can be read as an attempt to keep the benefits of masked prediction while avoiding low-level reconstruction pressure—pushing representation learning closer to semantic world modeling.

Part VII — Joint-Embedding Predictive Learning

The multimodal story in Part VI focused on how representations from different modalities are aligned (dual encoders) or fused (cross-attention / Q-Former / Perceiver-style). This part shifts the question: what objective should we use to learn a representation that captures semantics while discarding unpredictable low-level details?

A particularly influential answer is the Joint-Embedding Predictive Architecture (JEPA) viewpoint: learn by predicting representations (embeddings) of missing / transformed / future content from the representations of observed content—without reconstructing pixels, and without using explicit negatives. In its modern instantiations, JEPA typically combines:

a context encoder that processes visible content,
a target encoder (often momentum-updated) that produces stable targets,
a predictor that maps context representations to target representations, conditioned on where / what to predict (mask tokens, positional tokens, or time offsets),
an embedding-space regression loss (often L2 or SmoothL1) to match predicted and target embeddings.

The “predict in latent space” decision is the key: it lets the model allocate capacity to predictable, semantic structure, while allowing nuisance details (texture noise, lighting quirks, stochastic motion) to be ignored if they are not necessary for downstream discrimination.

Formally, let a sample produce a context $x$ and a target $y$ from the same underlying instance (image / video), along with side information $r$ describing the relationship between them (mask geometry, target locations, time offset):

\[s_x=f_\theta(x),\qquad s_y=f_{\bar\theta}(y),\qquad \hat s_y=g_\psi(s_x, r).\]

The training objective is a representation-matching loss:

\[\mathcal{L}_{\text{JEPA}}=\mathbb{E}\big[D(\hat s_y,\ \text{stopgrad}(s_y))\big],\]

with $f_{\bar\theta}$ typically updated by EMA:

\[\bar\theta \leftarrow m\bar\theta + (1-m)\theta.\]

In the remainder of this part we focus on two canonical, “clean” instantiations of JEPA in vision: I-JEPA for images and V-JEPA for videos.

16. I-JEPA

I-JEPA ⁵⁰ makes the JEPA principle concrete for images with a minimal recipe: predict the representations of masked target blocks from a sparse context block, entirely in embedding space.

The conceptual stance is important: unlike masked autoencoding (MAE/BEiT), I-JEPA is non-generative—it does not aim to reconstruct pixels or discrete tokens. Unlike contrastive alignment (SimCLR/CLIP), it avoids explicit negatives and does not require multi-view positive pairs. Instead, the supervision comes from the structure of the image itself: if a representation is meaningful, it should be possible to infer missing semantics from visible context.

16.1 Prediction in representation space

The “feature prediction” choice resolves a tension that repeatedly appears in self-supervised learning:

Pixel reconstruction forces the model to preserve everything, including high-frequency detail that may be irrelevant for semantics.
Contrastive invariance forces the model to throw away information that changes under augmentations, which can be brittle if augmentations are poorly matched to the downstream task.
JEPA/I-JEPA targets a middle ground: learn a representation that is predictive of missing content at the level of abstractions encoded by the target encoder.

This framing also makes the goal explicit: we are not learning to generate images; we are learning an embedding space where semantic completion is possible.

16.2 Architecture: context encoder, target encoder, predictor

I-JEPA uses ViTs for all components, but with an asymmetric role split:

Target encoder $f_{\bar\theta}$: processes the image into patch-level features and provides the regression targets.
Context encoder $f_\theta$: processes only the visible context (efficient, MAE-style).
Predictor $g_\psi$: a narrow ViT that takes context features and predicts the target block features, conditioned on target locations via mask tokens + positional embeddings.

Concretely, for an image $y$ split into $N$ patches, the target encoder yields patch-level representations:

\[s_y=\{s_{y1},\ldots,s_{yN}\}.\]

Target blocks are defined by patch index sets $B_i\subset{1,\ldots,N}$, so the target for block $i$ is

\[s_y^{(i)}=\{s_{yj}\}_{j\in B_i}.\]

The context block corresponds to another index set $B_x$, producing:

\[s_x=\{s_{xj}\}_{j\in B_x}.\]

The predictor takes $s_x$ and a set of learnable mask tokens (one per target patch, with position added) to output predictions $\hat s_y^{(i)}$ for the target block.

16.3 Masking strategy: semantic targets, sparse but informative context

The masking scheme is not an implementation detail; it defines the task difficulty and the level of semantics learned. I-JEPA samples:

Target blocks: typically multiple blocks per image (e.g., $M=4$), each at moderate scale, with randomized aspect ratios. These blocks should be large enough to be semantically meaningful rather than texture-level noise.
Context block: a single large block (high scale), then removes overlaps with target blocks to ensure a non-trivial prediction problem.

A subtle but crucial design choice is that targets are masked on the output of the target encoder, not on the input image. This encourages target features to remain semantically rich even when only a subset of patches is used as the supervision signal.

16.4 Objective and optimization

The loss is a simple average L2 regression over target patches:

\[\mathcal{L}_{\text{I-JEPA}} = \frac{1}{M}\sum_{i=1}^{M} D\left(\hat s_y^{(i)}, s_y^{(i)}\right), \qquad D\left(\hat s_y^{(i)}, s_y^{(i)}\right) = \sum_{j\in B_i}\|\hat s_{yj}-s_{yj}\|_2^2.\]

Only the context encoder and predictor receive gradient updates; the target encoder is updated by EMA. This is the same stabilization pattern that repeatedly appears in non-contrastive self-distillation (BYOL/DINO-style), now used to make feature prediction stable at scale.

16.5 Training pipeline (end-to-end)

A compact view of the training loop:

Patchify an image into $N$ tokens (ViT patch embedding).
Target features: run the full image through the target encoder $f_{\bar\theta}$ to obtain $s_y$.
Sample targets: draw $M$ target masks $\{B_i\}_{i=1}^M$ and extract $s_y^{(i)}$.
Sample context: draw context mask $B_x$, remove overlap with $\{B_i\}$, and run the masked image through context encoder $f_\theta$ to obtain $s_x$.
Predict: for each target block $i$, run the predictor $g_\psi(s_x,{m_j}_{j\in B_i})$ to obtain $\hat s_y^{(i)}$.
Loss: compute $\mathcal{L}_{\text{I-JEPA}}$ as embedding-space regression.
Update $\theta,\psi$ by backprop; update $\bar\theta$ by EMA.

16.6 What representation is used downstream?

I-JEPA produces patch-level features naturally. For downstream usage, the common pattern is to treat the context encoder as the backbone:

Global representation: pool patch tokens (mean pooling or attentive pooling) into a vector.
Dense representation: use patch tokens directly for detection/segmentation-style heads.
Transfer protocols: linear probing, attentive probing, or full fine-tuning depending on the task.

Note the practical implication: because the pretraining objective is unnormalized regression, downstream performance may benefit from learnable pooling or lightweight adapters rather than naive averaging, especially when frozen evaluation is required (this theme becomes explicit in V-JEPA).

16.7 Relation to MAE / distillation / “non-generative semantics”

I-JEPA sits in a very specific niche:

Like MAE, it is masking-based and processes only visible tokens in the main encoder (efficiency).
Unlike MAE, it predicts representations, not pixels—so it is not trained to be a generator.
Like BYOL/DINO, it uses a momentum teacher to stabilize training without negatives.
Unlike DINO, the supervision is not a probability distribution matching; it is direct feature regression tied to where a target block is in the same image.

This combination makes I-JEPA a canonical example of “semantic completion without generation”: the model learns to infer missing content at the level of abstractions encoded by a learned target network.

17. V-JEPA

If images teach semantics via spatial context, videos add an additional supervisory signal that is difficult to obtain from static data: temporal coherence and dynamics. V-JEPA ⁵¹ extends the JEPA/I-JEPA idea to video by learning through masked spatio-temporal feature prediction—again without pixel reconstruction, without text, and without negatives.

17.1 From spatial completion to spatio-temporal prediction

The core assumption is simple: a useful video representation should support predicting missing regions that are consistent with:

objects (appearance and identity across time),
motion (how things change),
occlusion and persistence (objects remain coherent when partially hidden).

V-JEPA operationalizes this via masked modeling, but predicts in feature space instead of pixels. This avoids spending capacity on low-level details that are ambiguous in video (e.g., stochastic textures, background flicker), while still forcing the model to encode the structure needed to make coherent predictions.

17.2 Architecture and parameterization

V-JEPA uses a ViT-style backbone adapted to video by flattening a clip into spatio-temporal tokens (tubelets). As in I-JEPA:

a context encoder processes visible tokens,
a target encoder (EMA) provides target token representations,
a predictor (a narrow ViT) maps context features to predicted target features.

Two additional design points matter in video:

No [CLS] token during pretraining: the network is trained to represent content at the token level; global representations are formed later via pooling/probing.
Predictor as a capacity bottleneck: the predictor is kept narrower than the backbone, encouraging the backbone to carry the heavy representational load, while the predictor handles “how to fill” a given mask geometry.

17.3 Masking: tubes, multi-block, and multi-mask efficiency

V-JEPA studies masking strategies that define what kind of temporal reasoning the backbone must learn:

Random tubes: mask spatial patches extended through time (forces spatial inference, weaker temporal supervision).
Causal multi-block: restrict visible context to early frames and mask later regions (encourages forward prediction).
Multi-block (full clip): mask multiple spatio-temporal blocks across the clip, requiring joint spatial-temporal completion.

For efficiency, V-JEPA also uses multi-mask prediction: sample multiple masks (e.g., short-range and long-range) for the same clip so the target representation can be amortized, while still training the predictor/context pathway across different prediction problems.

17.4 Objective and evaluation implications (why attentive probing appears)

At the objective level, V-JEPA remains “pure JEPA”: feature regression from visible region to masked region, using a stable target encoder.

A practical consequence is that, because the loss is unnormalized feature regression, a linearly separable global subspace is not guaranteed. This motivates evaluation protocols that do not assume “mean pool + linear head” is always optimal.

V-JEPA therefore emphasizes attentive probing as a strong frozen-evaluation protocol: use a learnable query token and cross-attention to pool token features before classification. This keeps the backbone frozen while allowing a small amount of task-adaptive pooling capacity.

17.5 Training pipeline (end-to-end)

A high-level training loop mirrors I-JEPA, with video-specific masking:

Sample a clip (e.g., $T$ frames with stride) and tokenize into tubelet patches.
Run the clip through the target encoder to obtain token features $s_y$.
Sample spatio-temporal masks (often multiple masks per clip); define visible tokens (context) and masked tokens (targets).
Run visible tokens through the context encoder to obtain $s_x$.
Run the predictor conditioned on the mask geometry / target positions to obtain predicted token features $\hat s_y$ for masked regions.
Regress $\hat s_y$ to $\text{stopgrad}(s_y)$ over masked indices.
Update context encoder + predictor by gradient; update target encoder by EMA.

17.6 Grounding feature prediction (making predictions interpretable)

A recurring critique of “predict in latent space” is interpretability: what exactly is being predicted? V-JEPA addresses this empirically by attaching a conditional diffusion decoder on top of predicted features (with the encoder/predictor frozen) to visualize what the predicted embeddings correspond to in pixel space. Qualitative results show that shared structure across multiple decoded samples reflects the deterministic content captured by predicted embeddings, while variation reflects uncertainty—an intuitive demonstration of why feature prediction can focus on stable semantics without committing to a single pixel-level future.

17.7 Why V-JEPA matters for the “understanding-first” backbone

V-JEPA can be read as a strong statement about representation learning priorities:

Video provides supervision that images cannot: motion concepts and temporal causality signals emerge naturally from predictive objectives.
Feature prediction is competitive while being efficient: learning to predict representations can reduce the burden of modeling every pixel detail, enabling shorter schedules and strong frozen-backbone transfer.
JEPA scales as a backbone philosophy: once a backbone encodes “what is predictable about the world,” it becomes a strong substrate for downstream heads (classification, localization) and for coupling with generative decoders when generation is required.

18. References

Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[C]//Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR 119, 2020: 1597-1607. ↩ ↩²
Xue Y, Gan E, Ni J, et al. Investigating the benefits of projection head for representation learning[J]. arXiv preprint arXiv:2403.11391, 2024. ↩
Gupta K, Ajanthan T, Hengel A, et al. Understanding and improving the role of projection head in self-supervised learning[J]. arXiv preprint arXiv:2212.11491, 2022. ↩
He K, Fan H, Wu Y, et al. Momentum Contrast for Unsupervised Visual Representation Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020: 9729-9738. ↩
Chen X, Fan H, Girshick R, et al. Improved baselines with momentum contrastive learning[J]. arXiv preprint arXiv:2003.04297, 2020. ↩
Grill J B, Strub F, Altché F, et al. Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in neural information processing systems, 2020, 33: 21271-21284. ↩
Chen X, He K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15750-15758. ↩
Tian Y, Chen X, Ganguli S. Understanding self-supervised learning dynamics without contrastive pairs[C]//International Conference on Machine Learning. PMLR, 2021: 10268-10278. ↩
Caron M, Touvron H, Misra I, et al. Emerging Properties in Self-Supervised Vision Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 9650-9660. ↩ ↩²
Oquab M, Darcet T, Moutakanni T, et al. Dinov2: Learning robust visual features without supervision[J]. arXiv preprint arXiv:2304.07193, 2023. ↩
Siméoni O, Vo H V, Seitzer M, et al. Dinov3[J]. arXiv preprint arXiv:2508.10104, 2025. ↩
Caron M, Misra I, Mairal J, et al. Unsupervised learning of visual features by contrasting cluster assignments[J]. Advances in neural information processing systems, 2020, 33: 9912-9924. ↩
Caron M, Bojanowski P, Joulin A, et al. Deep clustering for unsupervised learning of visual features[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 132-149. ↩
Zbontar J, Jing L, Misra I, et al. Barlow Twins: Self-Supervised Learning via Redundancy Reduction[C]//ICML. 2021. ↩
Bardes A, Ponce J, LeCun Y, et al. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning[C]//ICLR. 2022. ↩
Xie Z, Zhang Z, Cao Y, et al. Simmim: A simple framework for masked image modeling[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 9653-9663. ↩
He K, Chen X, Xie S, et al. Masked Autoencoders Are Scalable Vision Learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022: 16000-16009. ↩
Bao H, Dong L, Piao S, et al. BEiT: BERT Pre-Training of Image Transformers[C]//International Conference on Learning Representations (ICLR). 2022. ↩
Peng Z, Dong L, Bao H, et al. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. arXiv / OpenReview (ICLR submission), 2022. ↩
Dong X, Bao J, Zhang T, et al. PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. AAAI, 2023. (arXiv:2111.12710) ↩
Baevski A, Hsu W-N, Xu Q, et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. ICML, 2022. (arXiv:2202.03555) ↩
Baevski A, Babu A, Hsu W-N, Auli M. Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language (data2vec 2.0). arXiv / OpenReview, 2022–2023. (arXiv:2212.07525) ↩
Wei C, Fan H, Xie S, Wu C-Y, Yuille A, Feichtenhofer C. Masked Feature Prediction for Self-Supervised Visual Pre-Training. CVPR, 2022. (arXiv:2112.09133) ↩
Higgins I, Matthey L, Pal A, et al. beta-vae: Learning basic visual concepts with a constrained variational framework[C]//International conference on learning representations. 2017. ↩
Van Den Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks[C]//International conference on machine learning. PMLR, 2016: 1747-1756. ↩ ↩²
Chen M, Radford A, Child R, et al. Generative pretraining from pixels[C]//International conference on machine learning. PMLR, 2020: 1691-1703. ↩
Tang L, Jia M, Wang Q, et al. Emergent correspondence from image diffusion[J]. Advances in Neural Information Processing Systems, 2023, 36: 1363-1389. ↩
Luo G, Dunlap L, Park D H, et al. Diffusion hyperfeatures: Searching through time and space for semantic correspondence[J]. Advances in Neural Information Processing Systems, 2023, 36: 47500-47510. ↩
Yang X, Wang X. Diffusion model as representation learner[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 18938-18949. ↩
Li D, Ling H, Kar A, et al. Dreamteacher: Pretraining image backbones with deep generative models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 16698-16708. ↩
Hudson D A, Zoran D, Malinowski M, et al. Soda: Bottleneck diffusion models for representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 23115-23127. ↩
Yu S, Kwak S, Jang H, et al. Representation alignment for generation: Training diffusion transformers is easier than you think[J]. arXiv preprint arXiv:2410.06940, 2024. ↩
Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv preprint arXiv:1511.06434, 2015. ↩
Donahue, Jeff, Philipp Krähenbühl, and Trevor Darrell. “Adversarial feature learning.” arXiv preprint arXiv:1605.09782 (2016). ↩
Dumoulin, Vincent, et al. “Adversarially learned inference.” arXiv preprint arXiv:1606.00704 (2016). ↩
Donahue J, Simonyan K. Large scale adversarial representation learning[J]. Advances in neural information processing systems, 2019, 32. ↩
Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4401-4410. ↩
Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8110-8119. ↩
Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[J]. Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, 139: 8748-8763. ↩
Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[J]. Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, 139: 4904-4916. ↩
Zhai X, Wang X, Mustafa B, et al. Lit: Zero-shot transfer with locked-image text tuning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18123-18133. ↩
Cherti M, Beaumont R, Wightman R, et al. Reproducible scaling laws for contrastive language-image learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 2818-2829. ↩
Yao L, Huang R, Hou L, et al. Filip: Fine-grained interactive language-image pre-training[J]. arXiv preprint arXiv:2111.07783, 2021. ↩
Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 2021, 34: 9694-9705. ↩
Zhai X, Mustafa B, Kolesnikov A, et al. Sigmoid loss for language image pre-training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 11975-11986. ↩
Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International conference on machine learning. PMLR, 2022: 12888-12900. ↩
Dai W, Li J, Li D, et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning[J]. Advances in neural information processing systems, 2023, 36: 49250-49267. ↩
Li J, Li D, Savarese S, Hoi S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[C]//ICML. 2023. ↩
Yu J, Wang Z, Vasudevan V, et al. CoCa: Contrastive captioners are image-text foundation models[J]. Transactions on Machine Learning Research, 2022. ↩
Assran M, Duval Q, Misra I, et al. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023: 15619-15629. ↩ ↩²
Bardes A, Garrido Q, Ponce J, et al. Revisiting Feature Prediction for Learning Visual Representations from Video[EB/OL]. arXiv:2404.08471, 2024. ↩ ↩²

Share on

Twitter Facebook LinkedIn

Anbu Huang