Cola DLM

PAGE 01 · ABSTRACT + §1

Why break free of left-to-right?

Autoregressive (AR) language models factorize text with the chain rule and predict one token at a time, left to right. That is enormously effective — but it welds generation to a single token ordering, makes inference inherently sequential, and bakes in a strong hand-crafted inductive bias. The paper's opening claim: high-quality language generation does not require a fixed order, and need not be defined by recovering tokens at all.

A · INTUITION

Three goals nobody hits at once

Existing paradigms each sacrifice one of: generation efficiency, scalable representation learning, and global semantic modeling. The paper's whole motivation is to get all three together.

A · THE PIVOT

Diffusion as prior transport

The key reframing: use diffusion not for token-level observation recovery, but to transport a latent prior. Global semantic organization happens in continuous latent space; local word realization is delegated to a decoder.

A · NAME

Continuous Latent Diffusion Language Model. A hierarchical latent-variable language model: Text VAE → block-causal DiT prior → conditional decoder.

C · LIVETwo paradigms, side by side

AR (top): token i can only attend to tokens <i. Strictly sequential — generation depth = sequence length.

Cola DLM (bottom): first transport a noise seed into a global semantic latent, then realize all words in parallel through the decoder.

B · THE THESIS, IN ONE LINE

From a unified Markov-path perspective, the diffusion process performs latent prior transport rather than token-level observation recovery — separating global semantic organization from local textual realization. The generative model is just:

$$ p(x)=\int \underbrace{p_\theta(x\mid z_0)}_{\text{decoder: realize words}}\;\underbrace{p_\psi(z_0)}_{\text{prior: global semantics}}\;dz_0 $$

p(x)	the probability the model assigns to a piece of text x.
z₀	the continuous latent — a "meaning vector" / global semantic plan.
p_θ(x\|z₀)	the decoder: how likely this meaning is realized as exactly this text.
p_ψ(z₀)	the prior: how plausible that meaning is in the first place.
∫ … dz₀	marginalize — sum over all possible meanings.

In plain English: "The probability of a sentence x = sum, over every possible meaning-vector z₀, of (how likely that meaning is) × (how likely the decoder turns that meaning into exactly this sentence)." Meaning is chosen first; words come second.

PAGE 02 · §1 CONTRIBUTIONS

A new paradigm: hierarchical information decomposition

Cola DLM first learns a stable text↔latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through a conditional decoder. Because block-causal attention is bidirectional within a block but causal across blocks, it keeps cross-block causal structure while allowing efficient parallel computation inside each block.

AUTOREGRESSIVE

Clear objective, rigid order

Parameterize token-level conditionals → a clean training target. But the fixed order forces sequential inference and a strong hand-crafted bias that limits general generation.

DISCRETE DIFFUSION

Free order, costly recovery

Drops left-to-right, but still does observation recovery in discrete token space → costly multi-step sampling, and intermediate discrete states can't stably hold global semantics.

CONTINUOUS DIFFUSION

Continuous, but token-aligned

Adds continuous spaces, yet most still use the diffusion path to recover token-aligned representations rather than explicitly model a latent prior.

No prior framework jointly delivers non-autoregressive generation + continuous representation + probabilistic modeling. Cola DLM is built to close exactly that gap ↓

C · CLICK A CONTRIBUTION

A · WHAT IT BUYS YOU

B · TWO LEVELS, ONE PROBABILISTIC FRAME

The decomposition weakens the inductive bias of fixed token order, lets the geometry of continuous space directly support compression and prior fitting, and enables a more flexible non-autoregressive generation process. It is also modular — any latent module (AE, RAE, …) and other continuous modalities plug in.

Plain English: instead of one giant network juggling "what to say" and "how to word it" token by token, Cola splits the job: a prior decides the global gist, a decoder handles the wording. Two easier sub-problems beat one hard one — when the data really has that structure.

GLOBAL SEMANTIC MODELING

prior p_ψ(z₀) · continuous latent

↓

LOCAL TEXTUAL REALIZATION

decoder p_θ(x | z₀) · discrete text

unified by one ELBO →

PAGE 03 · §2 RELATED WORK

The landscape: where does continuity live?

The single sharpest question for placing any text model is: in which space does the model put its "stochastic path", and what is that path's job? Every prior family answers differently — and every one keeps the path tied to recovering an observation. Cola DLM is the first to move the path entirely into a compressed latent prior.

C · EXPLORE THE TAXONOMYhover / tap a node

A · TENSOR DOMAIN

Vocabulary-aligned continuous

Diffusion on one-hot vectors / logit-simplices. Representation dim = |V| (vocabulary size, ~10⁴–10⁵). Scales badly.

A · TENSOR DOMAIN

Token-embedding continuous

Map tokens to embedding space ℝ^L×e, diffuse there. Still recovers token-aligned targets — no explicit latent prior.

A · TENSOR DOMAIN

Latent-space continuous

Compress to latent z₀ ∈ ℝ^d via VAE, diffuse there. Cola lives here — but treats the latent as a hierarchical variable with a learned prior, not a fixed code.

PAGE 04 · §3.1.1 THEORETICAL FORMULATION

The generative model: decoder × prior

Everything begins with two players. Let x ∈ 𝒳 be a discrete text sequence, and z₀ ∈ ℝ^d its continuous latent variable. Cola's generative model is just a conditional decoder p_θ(x | z₀) and a latent prior p_ψ(z₀).

A · OBSERVATION

x ∈ 𝒳

A discrete token sequence (length ≤ 512 in experiments). The observable.

A · LATENT

z₀ ∈ ℝ^d

Continuous latent. d = 16 / 64 / 128 studied. Carries global semantics. Sequence-structured into blocks.

A · BASE SEED

z₁ ∼ 𝒩(0, I)

Pure Gaussian noise in ℝ^d. The flow's starting point at time t=1.

B · EQUATION 3.1 — THE GENERATIVE MODEL

$$ p(x,z_0)=p_\theta(x\mid z_0)\,p_\psi(z_0),\qquad p(x)=\int p_\theta(x\mid z_0)\,p_\psi(z_0)\,dz_0 $$

p_ψ(z₀)	the prior over latents — "what meanings are plausible." Parameters ψ (the DiT).
p_θ(x\|z₀)	the decoder — "given this meaning, how likely is this exact text." Parameters θ.
∫ … dz₀	marginalize: sum over all possible latents to get the text likelihood.
q_φ(z₀\|x)	the encoder — used only for variational inference during training; not part of the generative model.

Read it aloud: "Sample a meaning z₀ from the prior. Hand it to the decoder, which produces text x." The encoder q_φ is the training-time helper that guesses the latent from text — like training wheels you remove before riding (generating). Crucially, the model is defined by prior × decoder alone.

C · INTERACTIVE GRAPHICAL MODEL

B · EQ 3.2 — A CONTINUOUS-FLOW PRIOR

$$ z_1\sim p_1,\quad \frac{dz_t}{dt}=v_\psi(z_t,t),\quad z_0=\Phi^\psi_{0\leftarrow1}(z_1) $$

p₁ = 𝒩(0,I)	base distribution at time t=1 (pure noise).
v_ψ(z_t,t)	a learned vector field — at every point & time it says "which way to flow."
Φ^ψ_0←1	the flow map: integrate the ODE from t=1 down to t=0. Turns noise z₁ into a structured latent z₀.

Plain English: the prior isn't a fixed bell curve — it's "start from noise and follow a learned current." Wherever the current carries the particles, that is where meanings concentrate. This pushed-forward distribution is written p_ψ=(Φ^ψ_0←1)_♯p₁.

B · EQ 3.3 — BLOCK-CAUSAL FACTORIZATION

$$ z_0=(z_0^{(1)},\dots,z_0^{(B)}),\quad p_\psi(z_0)=p_\psi(z_0^{(1)})\prod_{b=2}^{B}p_\psi\!\big(z_0^{(b)}\mid z_0^{(<b)}\big) $$

z₀^(b)	the b-th block of latents (each block = several tokens; block size 16 works best).
B	number of blocks in the sequence.
z₀^(<b)	all earlier blocks — block b is conditioned on its history.

Plain English: the latent is chopped into chunks. Chunk b depends only on the chunks before it (causal across blocks) — like writing paragraph by paragraph, each informed by what came before, but the words within a paragraph are filled in together. This is exactly the structure the inference loop and the block-causal DiT use later.

PAGE 04 · §3.1.1 ELBO & FLOW MATCHING

One objective, three jobs

You can't compute ∫…dz₀ directly, so training maximizes a lower bound — the ELBO. Its real beauty is what it decomposes into: reconstruction, compression, and prior matching. Three analytically separable jobs.

B · EQ 3.4 — THE LOWER BOUND

$$ \log p(x)\ge \E_{q_\phi(z_0|x)}\!\big[\log p_\theta(x\mid z_0)+\log p_\psi(z_0)-\log q_\phi(z_0\mid x)\big]=:\mathcal{L}_{\mathrm{ELBO}}(x) $$

log p(x)	the true marginal log-likelihood of text x — the intractable target we'd love to maximize.
𝔼_{q_φ(z₀\|x)}	average over latents drawn from the encoder posterior — the variational helper that guesses z₀ from x.
log p_θ(x\|z₀)	reconstruction term — how well the decoder rebuilds the exact text from the latent.
log p_ψ(z₀)	prior term — how much the DiT/flow prior likes this particular latent.
−log q_φ(z₀\|x)	entropy term — penalizes an over-confident (too-peaked) encoder; keeps the bound honest.
ℒ_ELBO(x)	the Evidence Lower BOund itself — what training actually maximizes (=: means "defined as").

Plain English: "true likelihood ≥ (decoder can rebuild the text) + (the prior likes this latent) − (penalty for an over-confident encoder)." Training pushes this bound up; by Jensen's inequality it never overshoots the truth, so maximizing ℒ_ELBO safely drags the real log p(x) up with it.

B · EQ 3.5 — THE INFORMATION DECOMPOSITION (the key one)

$$ \E_{p_{\text{data}}}[\mathcal{L}_{\mathrm{ELBO}}(x)]=\underbrace{\E_{q(x,z_0)}[\log p_\theta(x\mid z_0)]}_{\text{1. conditional reconstruction}}-\underbrace{I_q(X;Z_0)}_{\text{2. compression}}-\underbrace{\KL(\bar q_\phi(z_0)\,\|\,p_\psi(z_0))}_{\text{3. prior matching}} $$

𝔼_{p_data}	average over the data distribution — i.e. the ELBO averaged across all real text, not one sample.
q(x,z₀)	the joint p_data(x)·q_φ(z₀\|x) — sample real text, then encode it to a latent.
I_q(X;Z₀)	mutual information between text and latent: how many nats the latent stores about x — the rate / compression cost (subtracted).
q̄_φ(z₀)	the aggregated posterior ∫ q_φ(z₀\|x) p_data(x) dx — the marginal cloud of all encoder outputs the prior must fit.
KL(·‖·)	Kullback–Leibler divergence — the prior-matching gap between that cloud and the prior p_ψ.

Why this is "the key one": averaging the ELBO splits the encoder's role into three analytically separate jobs — realize text, set the compression rate, and make the prior fittable. The three cards below are exactly these terms.

1 · RECONSTRUCTION

How well the decoder rebuilds x from z₀. Want it high.

2 · I(X;Z₀)

Bits the latent stores about text. The compression cost — subtracted, so the model is pushed to stay compact.

3 · KL PRIOR MATCH

How far the aggregated posterior q̄_φ is from the prior p_ψ. The DiT's job is to shrink this.

C · ELBO BUDGET EXPLORER

Drag the latent information rate I(X;Z₀). Watch the three terms trade off — and see why Cola needs the data to have low-rate global semantics (Eq 3.5 in motion).

retain little (heavy compression)retain everything

I(X;Z₀) = 42 bits (relative)

resulting ELBO

—

B · EQ 3.6 — PRIOR LEARNING = DISTRIBUTION MATCHING

$$ \max_\psi \,\E_{z_0\sim\bar q_\phi}[\log p_\psi(z_0)]\;\Longleftrightarrow\;\min_\psi \,\KL(\bar q_\phi(z_0)\,\|\,p_\psi(z_0)) $$

max_ψ	optimize over the prior's parameters ψ (the DiT) — encoder & decoder held fixed.
𝔼_{z₀∼q̄_φ}	average over latents sampled from the aggregated posterior — the encoder's output cloud.
⟺	"is equivalent to" — the two optimizations have the same solution.
KL(q̄_φ‖p_ψ)	divergence from that cloud to the prior; minimizing it = making the prior fit the latents the encoder emits.

Plain English: when the encoder & decoder are frozen, the prior's entire job is to become the distribution of latents the encoder actually produces (the aggregated posterior q̄_φ). Make the prior match that cloud → ELBO goes up.

B · EQ 3.7 — THE FLOW-MATCHING OBJECTIVE

$$ \mathcal{L}_{\mathrm{FM}}=\sum_{b=1}^{B}\E_{t,z_0,z_1}\!\Big[\big\|\,v_\psi(z_t^{(b)},t;z_0^{(<b)})-u_t^{(b)}(z_0,z_1)\,\big\|_2^2\Big] $$

v_ψ	the network's predicted velocity (conditioned on history blocks).
u_t	the target velocity of the straight conditional path (the ground truth direction).
‖·‖²₂	squared error — it's just regression onto a known direction.

Key subtlety: Flow Matching is a solver for the prior, not the definition of the model. The model is still the hierarchical latent-variable one (Eq 3.1); FM is merely an efficient way to learn the vector field that realizes p_ψ.

C · FLOW-MATCHING TRANSPORTnoise (t=1) → latent (t=0)

vector field trails

t = 1 · pure noise z₁ ∼ 𝒩(0,I)t = 0 · structured latent z₀

Each particle integrates dz_t/dt = v_ψ(z_t,t). The conditional training path is the straight line z_t = (1−α(t))z₀ + α(t)z₁ (Eq A.31) — FM regresses the velocity onto these lines, and at inference the learned field carries noise onto the multi-modal latent manifold.

D · APPENDIX A — DERIVATION Why the flow prior has an exact density, and what the ELBO really decomposes into ▸

The main text states Eq 3.1–3.3 and the ELBO as facts. Appendix A is where they are built. Four results matter, and they explain three things the page above only asserts: why a flow can report a likelihood at all, where the ELBO's slack hides, and why Flow Matching is a side-door, not the front door.

① The CNF prior is not implicit — it has a closed-form log-density (A.12–A.13)

The ODE's density obeys the continuity equation $\partial_t p_t(z) + \nabla\!\cdot\!\big(p_t(z)\,v_\psi(z,t)\big)=0$. Following one trajectory, this collapses to the instantaneous change-of-variables formula, then integrates:

$$ \frac{d}{dt}\log p_t(z_t) = -\nabla\!\cdot\! v_\psi(z_t,t) \;\;\Longrightarrow\;\; \log p_\psi(z_0)=\log p_1(z_1)+\int_0^1 \nabla\!\cdot\! v_\psi(z_t,t)\,dt $$

Plain English: a normal generative prior you can only sample from. This one hands you the actual probability number: flow the latent forward to noise, read off the Gaussian density where you land $\log p_1(z_1)$, and add the total log-volume the flow stretched along the way $\int \nabla\!\cdot v$. The divergence integral is the "how much did space squeeze" ledger.

② The ELBO's slack is an exact KL (A.19)

$$ \log p(x)=\mathcal{L}_{\mathrm{ELBO}}(x)+\KL\!\big(q_\phi(z_0\mid x)\,\|\,p(z_0\mid x)\big) $$

Training maximizes $\mathcal{L}_{\mathrm{ELBO}}$, never $\log p(x)$ itself. The gap is precisely how far the encoder $q_\phi$ is from the true posterior $p(z_0\mid x)$ — a fixed, non-negative tax. A poor encoder is a permanent likelihood penalty (this is the variational gap that returns in Appendix D & F).

③ With encoder/decoder frozen, the prior's only job is to become the aggregated posterior (A.27)

$$ \max_\psi\;\E_{p_{\text{data}}}\big[\mathcal{L}_{\mathrm{ELBO}}(x)\big]\;\;\Longleftrightarrow\;\;\min_\psi\;\KL\!\big(\bar q_\phi(z_0)\,\|\,p_\psi(z_0)\big) $$

This is the formal statement behind the budget widget's bar ③. The "target cloud" $\bar q_\phi(z_0)=\int q_\phi(z_0\mid x)\,p_{\text{data}}(x)\,dx$ is the marginal of all encoder outputs; the DiT prior is trained to match exactly it.

④ The information decomposition — the budget this page animates (A.28)

$$ \E_{p_{\text{data}}}\big[\mathcal{L}_{\mathrm{ELBO}}\big]=\underbrace{\E_{q(x,z_0)}[\log p_\theta(x\mid z_0)]}_{\text{reconstruction}}-\underbrace{I_q(X;Z_0)}_{\text{compression rate}}-\underbrace{\KL(\bar q_\phi\,\|\,p_\psi)}_{\text{prior match}} $$

One equation, three jobs for the encoder: realize text from the latent, decide how many nats the latent carries $I_q(X;Z_0)$, and set how hard the prior must work. The reason "raise the rate" doesn't monotonically help: it lifts reconstruction but is subtracted and also inflates the KL the prior must close.

⑤ Flow Matching is a solver, not the likelihood objective (A.34–A.36)

$$ v_\psi^\star(z,t)=\E\big[u_t(z_0,z_1)\,\big|\,z_t=z,\,t\big],\qquad \min_\psi \mathcal{L}_{\mathrm{FM}}(\psi;\phi)\;\neq\;\text{(term-by-term)}\;-\log p_\psi(z_0)$$

Directly maximizing $\log p_\psi(z_0)$ needs ODE solves and divergence estimates every step — expensive. FM instead regresses the velocity field onto the straight bridge paths; its pointwise optimum is the conditional-mean velocity. It learns the same prior $p_\psi$ far more cheaply, but you must not read $\mathcal{L}_{\mathrm{FM}}$ as if it were the negative prior log-likelihood. This distinction is the seed of the "likelihood ≠ generation" story (Appendix F).

BIG PICTURE · §3.1 FLOW TRANSPORT

OVERALL UNDERSTANDINGSee the whole mechanism move at once

The continuous diffusion field, in 3D

Everything Cola does at the prior level is transport. A learned velocity field v_ψ(z_t,t) carries a cloud of Gaussian noise (t=1) down onto a structured, multi-cluster latent manifold (t=0). The arrows are the field itself; the points are latents riding it along straight conditional paths z_t=(1−t)z₀+t z₁. Drag to orbit.

C · LIVE 3D FLOW FIELD

field arrows motion trails auto-spin

t=1 · noise z₁t = 1.00t=0 · latent z₀

Scrub the slider to freeze the flow at any time t, or let it sweep. Generation runs this backwards in t (1→0): drop a random seed, integrate along the arrows, land on a meaning.

The field is the model

The DiT prior is v_ψ. Flow Matching (Eq 3.7) only fits these arrows so the cloud lands on the right manifold — nothing else.

Noise → meaning

No token-by-token recovery. A whole global latent is transported at once, then handed to the decoder to realize words.

Why three basins

The target is multi-modal: different global meanings live in different basins. The field routes each seed into one — the geometry RQ1 detects as "global structure."

PAGE 05 · §3.1.2 PROBABILITY ESTIMATION

How do you score a sentence you never wrote directly?

AR models read off log p(x) for free from the token chain. Cola can't — the latent is marginalized. So at evaluation it approximates log p(x) by importance sampling: draw latents from the encoder, weight them, and combine. Two estimators fall out — and one is always tighter.

A · SAMPLES

z₀^(k) ∼ q_φ(z₀|x)

K latent draws from the encoder for one text x.

A · WEIGHTS

log w^(k) ∈ ℝ

A scalar importance weight per sample — high = "this latent explains the text well & the prior likes it."

A · AUGMENTED STATE

[z_t, ℓ_t] ∈ ℝ^d+1

Latent plus a running log-density accumulator ℓ, integrated by one ODE.

B · EQ 3.8 — THE IMPORTANCE WEIGHT

$$ \log w^{(k)}=\log p_\theta(x\mid z_0^{(k)})+\log p_\psi(z_0^{(k)})-\log q_\phi(z_0^{(k)}\mid x) $$

w^(k)	the importance weight of the k-th sample — the ratio that corrects for sampling latents from the encoder instead of the true posterior.
z₀^(k)	the k-th latent drawn from the encoder q_φ(z₀\|x) (we average over K of them).
log p_θ(x\|z₀)	decoder log-likelihood — reward if the latent rebuilds the text.
log p_ψ(z₀)	prior log-density (from the CNF, see Eq 3.9) — reward if the prior likes the latent.
−log q_φ(z₀\|x)	subtract the encoder's own log-density — the correction for having sampled from it.

Plain English: three log-numbers added up. (1) does the decoder predict this text from the latent? (2) does the prior think this latent is plausible? (3) minus how confidently the encoder proposed it. The first two reward; the third corrects for the fact we cheated by sampling from the encoder.

B · EQ 3.9–3.10 — THE PRIOR TERM BY ONE ODE

$$ \frac{d}{dt}\begin{bmatrix} z_t \\ \ell_t \end{bmatrix}=\begin{bmatrix} v_\psi(z_t,t) \\ \nabla\!\cdot v_\psi(z_t,t) \end{bmatrix},\quad \begin{bmatrix} z_0 \\ \ell_0 \end{bmatrix}=\begin{bmatrix} z_0^{(k)} \\ 0 \end{bmatrix} $$

$$ \log p_\psi(z_0^{(k)})=\log p_1(z_1^{(k)})+\ell_1^{(k)} $$

z_t	the latent state integrated along the ODE from t=0 (data) to t=1 (noise).
ℓ_t	the passenger accumulator — running total of the log-density change; starts at 0.
v_ψ(z_t,t)	the learned velocity field driving the flow.
∇·v_ψ	its divergence — the instantaneous log-volume change (computed via Eq 3.11).
log p₁(z₁^(k))	Gaussian base density at the noise endpoint z₁.
ℓ₁^(k)	the total accumulated volume change at t=1 — add it to get the prior log-density.

Plain English: to find how likely a latent is under the flow prior, run the flow forward (t: 0→1) back to noise, while a passenger variable ℓ tallies how much the flow stretched/squeezed space (the divergence). Final density = (Gaussian density of where you landed) + (total log-stretch). This is the continuous change-of-variables formula.

B · EQ 3.11 — HUTCHINSON'S TRACE TRICK

$$ \nabla\!\cdot v_\psi=\mathrm{Tr}\!\Big(\tfrac{\partial v_\psi}{\partial z_t}\Big)\approx \epsilon^\top \tfrac{\partial v_\psi}{\partial z_t}\,\epsilon,\quad \epsilon\sim\Nrm(0,I) $$

∇·v_ψ	the divergence we need — exactly the trace of the velocity field's Jacobian.
Tr(∂v_ψ/∂z_t)	trace of the d×d Jacobian — exact but O(d²) expensive.
ε	a random probe vector, ε∼𝒩(0,I); ε⊤Jε is an unbiased one-shot estimate of the trace.
≈	Hutchinson's stochastic estimator — one cheap vector-Jacobian product instead of the full trace.

Why: the exact divergence (trace of a d×d Jacobian) is brutal in high dimensions. Hutchinson's estimator replaces it with a single random projection ε⊤Jε — one cheap vector-Jacobian product. The same ε is frozen across one ODE solve so the trajectory stays consistent.

C · HOW log p_ψ(z₀) IS ACTUALLY COMPUTED — STEP BY STEP

Hutchinson ε⊤Jε

Integrate the flow from t=0 (latent z₀) to t=1 (noise). The passenger ℓ_t accumulates the divergence ∇·v — i.e. ℓ is the signed area under the ∇·v curve. The amber dots are Hutchinson's one-probe estimates ε⊤Jε: jittery per-step, yet they integrate to the same total. Final answer: log p_ψ(z₀)=log p₁(z₁)+ℓ₁.

t · flow progress

0.00

ℓ_t = ∫₀^t ∇·v dτ

0.00

log p_ψ(z₀) = log p₁ + ℓ₁

—

B · EQ 3.12 — TWO ESTIMATORS

$$ \log\widehat p_{\mathrm{ELBO},K}=\frac1K\sum_{k=1}^K \log w^{(k)} $$

average of the logs — looser bound

$$ \log\widehat p_{\mathrm{IWAE},K}=\log\Big(\frac1K\sum_{k=1}^K e^{\log w^{(k)}}\Big) $$

log of the average — typically tighter

log p̂_ELBO,K	average-of-logs over K weights — the looser bound (the ⟨arithmetic mean of log-weights⟩).
log p̂_IWAE,K	log-of-average (log-sum-exp) — the tighter bound; equals ELBO when K=1 and climbs toward true log p(x) as K→∞.
K	number of importance samples drawn from the encoder.
w^(k)	the per-sample importance weight from Eq 3.8.

C · ESTIMATOR PLAYGROUND

Draw K importance weights and watch ELBO-style vs IWAE-style estimates. IWAE is always ≥ ELBO (Jensen) and tightens toward the true log p(x) as K grows.

samples K8

ELBO-style—

IWAE-style—

gap (IWAE−ELBO)—

true log p(x)*—

*toy reference. Both estimators are lower bounds; IWAE sits between ELBO and the truth.

B · EQ 3.13–3.14 — CONDITIONAL SCORING

$$ \log p(x_{\text{res}}\mid x_{\text{pre}})=\log p(x_{\text{pre}},x_{\text{res}})-\log p(x_{\text{pre}}) $$

$$ \log\widehat p(x_{\text{res}}\mid x_{\text{pre}})=\log\widehat p(x_{\text{pre}},x_{\text{res}})-\log\widehat p(x_{\text{pre}}) $$

x_pre	the prefix / prompt context (already given).
x_res	the response being scored or ranked.
log p(x_res\|x_pre)	the conditional log-probability of the response = log of joint ÷ log of prefix.
p̂ (the hats)	plug-in estimators (Eq 3.12) substituted into the exact identity — convenient, but not themselves a certified bound.

Plain English: to score a response given a prompt, score the whole thing, score the prompt alone with the same estimator, and subtract. Probability of the part = probability of the whole ÷ probability of the prefix. Used for multiple-choice & continuation evals.

C · THE TWO ALGORITHMS — RUN THEM LIVE

Algorithm 1 · unconditional log p(x)

for k = 1…K: sample z₀^(k) ∼ q_φ(z₀|x)

w^(k) ← log p_θ(x|z₀) + log p_ψ(z₀) − log q_φ(z₀|x)

output ELBO = mean_k w^(k) · IWAE = logmeanexp_k w^(k)

k = 0 / 6 — three log-terms add up to one weight:

weight w^(k)

—

ELBO = — IWAE = — gap = —

Algorithm 2 · conditional log p(x_res|x_pre) — calls Alg 1 twice & subtracts

â = Alg1(x_pre,x_res)

—

b̂ = Alg1(x_pre)

—

â − b̂ = log p(res|pre)

—

Press ▶ Run to watch Algorithm 1 collect K weights and form both estimators, then Algorithm 2 difference them.

D · APPENDIX B — DERIVATION How you actually compute these log-likelihoods — the augmented ODE, Hutchinson's trace, and why the conditional plug-in is not a bound ▸

Eq 3.8–3.14 give the estimators; Appendix B gives the machinery that makes them runnable and the caveats that make them honest.

① The prior density needs an augmented ODE + a cheap trace trick (B.23–B.25)

Evaluating $\log p_\psi(z_0^{(k)})$ means solving the state and a log-density accumulator together, then estimating the divergence — which is a $d\times d$ Jacobian trace — with the Hutchinson estimator:

$$ \frac{d}{dt}\begin{bmatrix}z_t\\ \ell_t\end{bmatrix}=\begin{bmatrix}v_\psi(z_t,t)\\ \nabla\!\cdot v_\psi(z_t,t)\end{bmatrix},\qquad \nabla\!\cdot v_\psi=\E_\epsilon\!\big[\epsilon^\top \tfrac{\partial v_\psi}{\partial z_t}\,\epsilon\big] $$

Plain English: integrate a passenger $\ell$ alongside the latent that tallies the log-volume change; at $t=1$, $\log p_\psi(z_0)=\log p_1(z_1)+\ell_1$. The exact trace is brutal in high-$d$, so one random projection $\epsilon^\top J\epsilon$ (a single vector-Jacobian product) estimates it — with the same $\epsilon$ frozen across the whole solve so the trajectory stays self-consistent.

② ELBO vs IWAE — the live widget's two numbers (B.8–B.10)

$$ \log\widehat p_{\mathrm{ELBO},K}=\tfrac1K\!\sum_k \log w^{(k)}\;\le\;\log\widehat p_{\mathrm{IWAE},K}=\log\!\Big(\tfrac1K\!\sum_k e^{\log w^{(k)}}\Big)\;\le\;\log p(x) $$

Same importance weights $\log w^{(k)}=\log p_\theta(x\mid z_0^{(k)})+\log p_\psi(z_0^{(k)})-\log q_\phi(z_0^{(k)}\mid x)$, two ways to average. Average-then-log (IWAE) beats log-then-average (ELBO) by Jensen, and tightens toward the truth as $K\!\to\!\infty$. Both are still lower bounds — so an ELBO-based PPL is an upper bound on the true perplexity.

③ The conditional score is a plug-in difference — and loses the bound guarantee (B.12–B.14)

$$ \widehat{\log p}_{\text{cond}}(x^{\text{res}}\mid x^{\text{pre}}) := \widehat{\mathcal L}(x^{\text{pre}},x^{\text{res}}) - \widehat{\mathcal L}(x^{\text{pre}}) $$

The exact identity $\log p(x^{\text{res}}\mid x^{\text{pre}})=\log p(x^{\text{pre}},x^{\text{res}})-\log p(x^{\text{pre}})$ is run with two estimators (Algorithm A.2). Caveat the main text glosses: subtracting two bounds does not inherit a bound property — the difference can land on either side of the truth, so this is a practical estimator, not a certified lower bound. For ranking a single fresh block, the block-level score (B.21–B.22) suffices.

PAGE 06 · §3.2 WORKFLOW (FIGURE 1)

Two training stages, one inference cascade

The elegant probabilistic model of §3.1 is realized as a mechanical cascade: Stage 1 learns a stable text↔latent code with a Text VAE; Stage 2 jointly trains the VAE & the block-causal DiT to learn the final prior; Inference encodes the prefix, generates latent blocks autoregressively, and decodes — with a KV cache.

C · FIGURE 1, REBUILT — CLICK A STAGE

B · EQ 3.15 — STAGE 1 AUTOENCODING

$$ z_0\sim q_\phi(z_0\mid x),\qquad \hat x\sim p_\theta(x\mid z_0) $$

q_φ(z₀\|x)	the encoder — maps text to a latent z₀.
z₀	the per-token continuous latent (Stage 1 does not compress sequence length).
p_θ(x\|z₀)	the decoder — reconstructs text from the latent.
x̂	the reconstruction; training drives x̂ ≈ x.

Plain English: encode text → latent, decode latent → text, and make x̂ ≈ x. The goal is not the final prior yet — it's to fix a stable division of labor: what the latent stores vs. what the decoder recovers.

B · EQ 3.16 — THE VAE OBJECTIVE

$$ \mathcal{L}_{\mathrm{VAE}}=-\E_{q_\phi}\!\log p_\theta(x\mid z_0)+\beta\,\KL(q_\phi(z_0|x)\,\|\,p_{\text{base}}(z_0))+\lambda_{\text{mask}}\mathcal{L}_{\text{mask}} $$

−E log p_θ	reconstruction loss — rebuild the text.
β·KL	pull the encoder toward a base distribution p_base (regularize the latent–text interface).
λ_maskL_mask	a BERT-style masking loss: forces the encoder to keep semantics instead of letting the decoder memorize surface text.

Two design choices: the VAE does not compress sequence length (so each token still maps to a latent), and both encoder & decoder are strictly causal — to prevent information leakage and enable streaming generation.

PAGE 06 · §3.2.2 BLOCK-CAUSAL PRIOR

The block-causal mechanism — the heart of the DiT

How do you keep causal structure (so generation is well-defined) and parallel efficiency (so it's fast)? The answer is the attention mask. Bidirectional within a block, causal across blocks. This is the geometric meaning of the factorization in Eq 3.3 — and it's the single most important diagram in the paper.

B · EQ 3.17 — THE VISIBLE SET FOR BLOCK b

$$ \mathcal{V}_b=\big\{\;\sg(z_0^{(<b)}),\;\; z_t^{(b)}\;\big\} $$

sg(z₀^(<b))	the clean latent blocks before b, with a stop-gradient — used as fixed history, gradients don't flow back into them here.
z_t^(b)	the current noisy block being denoised at time t.

Plain English: when working on block b, the model is allowed to look at all finished, clean earlier blocks plus the noisy version of the current block. It cannot peek at future blocks. Within the current block every position sees every other (bidirectional); across blocks it only sees the past (causal).

C · BLOCK-CAUSAL ATTENTION MASK

block size 3 seq len 12

attends to clean history bidirectional within block masked (future)

WHY IT MATTERS

Larger blocks = more parallel positions denoised together, fewer sequential steps. But Eq 3.3's causal structure is preserved across blocks.

THE SWEET SPOT

Experiments (RQ3) find block size 16 best — too small loses local grouping, too large weakens semantic interactions. Try the slider: see the green/violet regions grow.

AT INFERENCE

A KV cache stores the green history columns, so each new block reuses past computation — the 1.6–2.0× generation-depth win over AR.

B · EQ 3.18 — THE STAGE-2 JOINT OBJECTIVE

$$ \mathcal{L}_{\text{stage2}}=\lambda_{\mathrm{VAE}}\!\underbrace{\Big(-\E_{q_\phi}\log p_\theta(x|z_0)+\beta\E_{q_\phi}\log q_\phi(z_0|x)+\lambda_{\text{mask}}\mathcal{L}_{\text{mask}}\Big)}_{\text{① autoencoding with regularized latent}}+\lambda_{\mathrm{fm}}\underbrace{\mathcal{L}_{\mathrm{FM}}}_{\text{② block prior}}+\lambda_{\text{ref}}\underbrace{\E_{p_{\text{data}}}\KL\big(q_\phi(z_0|x)\|q_{\phi_{\text{ref}}}(z_0|x)\big)}_{\text{③ anti-drift}} $$

① AUTOENCODING

Preserves the reconstruction + masking structure so the latent stays meaningful as it evolves.

② FLOW MATCHING

Learns the block-level conditional prior — the actual diffusion/transport loss.

③ REFERENCE KL

Pins the live encoder to a frozen reference encoder φ_ref — suppresses latent drift during joint training.

λ_VAE, λ_fm, λ_ref	scalar loss weights balancing the three blocks (autoencoding / flow-matching / anti-drift).
β	the KL/rate weight inside the VAE term — controls how strongly the latent is regularized (the rate knob from §3.2).
λ_mask · ℒ_mask	weight × the masked-reconstruction loss that keeps the latent structured.
ℒ_FM	the block-causal Flow-Matching loss (Eq 3.7) — learns the conditional prior.
q_{φ_ref}	a frozen reference encoder; the KL to it penalizes the live encoder for drifting.

In one breath: "keep autoencoding honest (①), learn the prior (②), and don't let the latent wander off while both train together (③)." This controlled co-adaptation is what lets the latent and prior improve each other — the empirical reason joint training beats a frozen VAE (RQ2).

C · ASSEMBLE THE LOSS — Eq 3.18 LIVE

Each block is a base loss × a weight λ. Slide the weights to see the single scalar Cola minimizes get assembled — and what breaks when one term is starved.

weighted contributions → total stage-2 loss—

①

②

③

λ_VAE · autoencode1.0

λ_fm · flow prior1.0

λ_ref · anti-drift0.3

PAGE 07 · §3.2.3 INFERENCE

Generation: encode the prefix, transport blocks, decode

Generation is "autoregressive in latent space." Encode the prompt into clean latent conditions, then produce the response one latent block at a time — each block is a fresh noise seed transported by the flow under the historical condition — and finally decode everything into words.

C · BLOCK-WISE GENERATION SIMULATOR

prefix (encoded, clean) generated latent block noise seed → transporting KV-cached history

B · EQ 3.19

① encode the prefix

$$ z^{\text{pre}}\sim q_\phi(z^{\text{pre}}\mid x^{\text{pre}}) $$

Run the prompt through the encoder once → clean conditioning latents. These never get re-noised (that's the "clean condition" winner from §5.2).

B · EQ 3.20

② transport block b

$$ \hat z_0^{(b)}=\Phi^\psi_{0\leftarrow1}\!\big(\epsilon^{(b)};z^{\text{pre}},\hat z_0^{(<b)}\big),\ \epsilon^{(b)}\!\sim\!\Nrm(0,I) $$

Draw fresh noise, then flow it down to a clean latent block — conditioned on the prefix and all previously generated blocks. Repeat for b = 1…B.

B · EQ 3.21

③ decode the response

$$ \hat x^{\text{res}}\sim p_\theta\big(x^{\text{res}}\mid z^{\text{pre}},\hat z_0^{(1:B)}\big) $$

Hand all generated latent blocks + prefix to the decoder → the text response. Two stages: sample a global latent, then realize text.

x^pre, x^res	the prompt prefix and the response to be generated.
z^pre ∼ q_φ	the prefix encoded once into a clean conditioning latent (never re-noised).
Φ^ψ_0←1(ε; ·)	the prior's flow map — integrates noise ε down to a clean latent, conditioned on the prefix and earlier blocks.
ẑ₀^(b) · ẑ₀^(<b)	the b-th generated latent block, conditioned on all blocks before it (block-causal).
ε^(b) ∼ 𝒩(0,I)	fresh Gaussian seed for block b.
p_θ(x^res\|·)	the decoder, realizing text from prefix + all generated latent blocks.

Summary (paper's own). The workflow implements the hierarchical probabilistic model through two training stages + one inference stage — a mechanical cascade, not a token-space reverse process. Stage 1's base prior regularizes the latent–text interface but is not the final prior; Stage 2's block-causal DiT learns the real prior p_ψ(z₀) while the VAE keeps autoencoding. At inference: encode prefix → generate latent blocks autoregressively → decode.

PAGES 07–08 · §3.3 UNIFIED VIEW

One frame to rule them all: paths over state spaces

Every text model can be written as a stochastic process τ=(S_t) on a state space, with a transition kernel and an emission rule. The real question isn't "who uses diffusion?" — it's what space the path lives in, and whether that path recovers an observation or transports a prior.

B · EQ 3.22 — THE COMMON OUTER FORM

$$ p_\Theta(x)=\int e_\Theta(x\mid\tau)\,P_\Theta(d\tau),\qquad P_\Theta(d\tau)=\mu_\Theta(ds_0)\prod_{t>0}K_t^\Theta(ds_t\mid s_{<t}) $$

τ = (s_t)	a whole trajectory of states — the path the model factorizes text over.
e_Θ(x\|τ)	the emission kernel — how the final text is read out from the path.
μ_Θ(ds₀)	the distribution of the initial state.
K_t^Θ(ds_t\|s_<t)	the transition kernel advancing the path one step (given the history).
P_Θ(dτ)	the induced law over whole trajectories = initial × product of transitions.

Plain English: "pick a starting state, evolve it by a transition kernel, then emit the text." Everyone fits this mould. The differences are entirely in (a) what the states s_t are, and (b) whether the emission is "read off a recovered observation" or "decode a transported latent." Two knobs explain four model families.

C · TABLE 1, ANIMATED — PICK A PATH

Method	State Space	Path Role	Generative Factorization	Continuity Appears	Explicit Latent?
AR	Prefix Tokens	Direct Generation	∏ᵢ p(xᵢ\|x<ᵢ)	None	✗
LLaDA	Discrete Masked Seqs	Observation-Recovery	p(s_T)∏ₜ p(s_{t-1}\|s_t)	Discrete token space	✗
Plaid	Continuous Token-Aligned	Observation-Recovery	p(h_T)∏ₜ p(h_{t-1}\|h_t)	Continuous token space	✗
Cola DLM	Compressed Latent Seqs	Prior-Transport	∫ p(x\|z₀)p(z₀)dz₀	Latent space	✓

B · EQ 3.26–3.27 — COLA'S PATH IS PRIOR-ONLY

$$ z_1\sim p_1,\ z_0=\Phi^\psi_{0\leftarrow1}(z_1),\ x\sim p_\theta(x\mid z_0) $$

$$ \frac{dz_t}{dt}=v_\psi(z_t,t),\quad p_\psi=(\Phi^\psi_{0\leftarrow1})_\sharp\,p_1 $$

z₁ ∼ p₁	a seed from the simple base (Gaussian) distribution.
Φ^ψ_0←1	the flow map carrying noise z₁ to a structured latent z₀ (integrates the ODE).
v_ψ(z_t,t)	the learned velocity field defining that flow.
(·)_♯ p₁	the pushforward — the prior p_ψ is exactly the base distribution carried through the flow.
p_θ(x\|z₀)	the decoder, which finally realizes text — the only place an observation x appears.

The crux: the path does not depend on any observation x. It only describes how to sample a semantic prior from Gaussian noise. AR/LLaDA/Plaid all run a path that corrupts then recovers a given sample. Cola's path transports a prior. Different state space and different target.

B · EQ 3.28 — WHY A LATENT AT ALL

$$ \E_{p_{\text{data}}}[\mathcal{L}_{\mathrm{ELBO}}]=\E_{q}[\log p_\theta(x|z_0)]-I_q(X;Z_0)-\KL(\bar q_\phi\|p_\psi) $$

𝔼_q[log p_θ(x\|z₀)]	reconstruction — how well the decoder realizes text from the latent.
I_q(X;Z₀)	mutual information = bits of global semantics compressed into z₀ (subtracted).
KL(q̄_φ‖p_ψ)	prior-matching gap between the aggregated posterior and the flow prior.

Plain English: z₀ isn't just a continuous stand-in for tokens — it's an explicit marginalized intermediate variable. Global semantics get compressed into z₀ (the I(X;Z₀) term), while local word realization is delegated to the decoder. That separation is the whole point of using a latent.

D · APPENDIX C — DERIVATION The four families as one Markov process — and the exact identity that says why a better prior helps ▸

Appendix C builds the abstract "process-based generative model" the table above summarizes, then asks one sharp question per family: into what state space, along what path, with the path doing observation-recovery or prior-transport?

① AR is a prefix-filtration Markov chain (C.7–C.9)

Set states $S_i:=x_{1:i}$. Then $(S_i)$ is a Markov chain whose one-step kernel is the AR conditional $p_\eta(x_i\mid x_{<i})$. Its true inductive bias isn't Markovianity — it's that conditioning is locked to the growing prefix $\sigma(X_{1:1})\subset\dots\subset\sigma(X_{1:L})$. Exact token likelihood, but a frozen left-to-right order.

② LLaDA & Plaid both corrupt-then-recover an observation (C.13–C.19)

LLaDA's masking is a continuous-time Markov chain that absorbs each token into a mask state with probability $t$ (C.16) — a reverse recovery over discrete states. Plaid does the same in a continuous token-aligned space $h_0=\mathrm{Embed}(x)$; as noise $\to 0$ its state stays glued to the observation (C.17). In the $\sigma_0^2\!\to\!0$ limit (C.19), Cola would degenerate to Plaid — which pinpoints the genuinely new ingredient: the latent decomposition itself, not the continuity.

③ Cola's path is prior-transport — observation-free (C.21–C.23)

The flow $z_1\!\sim\!p_1,\ z_0=\Phi^\psi_{0\leftarrow1}(z_1)$ never sees $x$. The encoder $q_\phi$ appears only in the variational bound (C.22), so it belongs to inference; in Plaid/LLaDA the forward process is part of the model definition. That is the precise sense in which Cola is "first and foremost a hierarchical latent-variable LM with a CNF prior, where flow is just a way to make the prior family expressive."

④ The exact identity behind "why diffusion / why a richer prior" (C.12)

$$ \E_{\bar q_\phi}\big[\log p_b(z_0)-\log p_a(z_0)\big]=\KL\!\big(\bar q_\phi\|p_a\big)-\KL\!\big(\bar q_\phi\|p_b\big) $$

For any two candidate priors $p_a,p_b$, the average-ELBO gain of swapping $a\!\to\!b$ is exactly the reduction in KL-to-aggregated-posterior. So whenever the flow/CNF prior sits closer to $\bar q_\phi$ than a plain Gaussian does, the average ELBO provably rises. "Why diffusion" is not about escaping max-likelihood — it is about buying a more expressive prior family that closes this KL.

PAGES 08–09 · §3.3.2 THEORETICAL ADVANTAGE

When does the latent bottleneck help?

The paper is refreshingly honest: diffusion and continuity guarantee nothing. Cola DLM wins only when the data has a specific shape — low-dimensional global semantics + high-dimensional local realization. This is made precise by a unified statistical-burden criterion and three governing curves.

B · EQ 3.29–3.31 — TOTAL STATISTICAL BURDEN

$$ \mathcal{E}(\mathcal{M}):=\inf_{p\in\mathcal{M}}\KL(p_{\text{data}}\|p) $$

$$ R_{\text{ColaDLM}}=\mathcal{E}(\mathcal{M}_{\text{ColaDLM}})+\inf_{\phi,\theta,\psi}G^{\text{ColaDLM}}_{\text{infer}} $$

ℰ(ℳ)	approximation error: the best the model family could ever do — irreducible mismatch with the truth.
G_infer	inference gap: extra cost from using a variational bound (the encoder's imperfection). AR has none of this.
R	total burden = how wrong the family is + how lossy its training objective is.

B · PROP 3.1 + EQ 3.32 — THE VERDICT

$$ \text{Cola DLM}\succ\text{AR}\iff R_{\text{ColaDLM}}<\mathcal{E}(\mathcal{M}_{\text{AR}}) $$

≻	"is better than" at the population level — i.e. lower total statistical burden.
R_ColaDLM	Cola's total burden = approximation error + inference gap (from Eq 3.29–3.31).
ℰ(ℳ_AR)	AR's only cost — its approximation error (AR reads exact likelihood, so its inference gap is 0).
⟺	"if and only if" — a strict, falsifiable condition, not a heuristic.

Plain English: AR pays only its approximation error (it reads off exact likelihood). Cola pays approximation error plus an inference gap. So Cola beats AR iff its richer latent model shrinks the approximation error by more than the inference gap costs. A real, falsifiable bar — not hand-waving.

C · THE BURDEN RACE — Eq 3.32 LIVE

Cola wins iff its total burden ℰ(ℳ_Cola)+𝒢_infer falls below AR's ℰ(ℳ_AR). Tune the three costs; the white line is AR's bar — Cola must finish left of it.

AR · approximation error only—

Cola · approx error + inference gap—

ℰ(ℳ_AR) · AR approx error55

ℰ(ℳ_Cola) · Cola approx error32

𝒢_infer · inference gap14

C · IS COLA DLM ADVANTAGEOUS? (Eq 3.35 LIVE)

Drag the data's rate-distortion curve D(R) and the two other knobs. The verdict lights up only when all three conditions hold simultaneously.

D(R) = min achievable reconstruction cost when the latent may transmit ≤ R nats about the text (Eq 3.33). A curve that drops fast at low R = data has a cheap, informative global summary.

how fast D(R) falls at low Rsteep

ℰ(ℳ_Cola) decreasing?yes

inference gap controllable?yes

B · EQ 3.34 — THE STRUCTURED-GENERATION ASSUMPTION

$$ p_{\text{data}}(x)=\int p^\star(x\mid g)\,p^\star(g)\,dg,\qquad H(X\mid G)\ll H(X),\quad \dim(G)\ll\dim(E(X)) $$

G, p^⋆(g)	a hypothesized global factor (topic, plan, style) and its true distribution.
p^⋆(x\|g)	the true mechanism that realizes the global factor into concrete text.
H(X\|G) ≪ H(X)	knowing G removes most of the text's uncertainty — i.e. G is highly informative.
dim(G) ≪ dim(E(X))	and G is low-dimensional vs the full embedded text — a cheap summary.

Plain English: suppose there's a small global variable G (topic, plan, style) that mostly determines the text. If knowing G removes most of the uncertainty (H(X|G)≪H(X)) and G is low-dimensional, then Cola's factorization matches the true generative mechanism: the prior models G, the decoder realizes it. That's exactly when the bottleneck helps rather than hurts.

CURVE 1

Rate-distortion D(R) — is there a low-rate sufficient representation?

CURVE 2

Prior approximation — can the DiT actually fit that prior?

CURVE 3

Inference gap G_infer — is the encoder good enough?

Summary (paper's own). The central advantage of Cola DLM is not denoising itself, but the latent decomposition that separates text modeling into a global prior and a conditional realization process. The four experiments now test whether real text actually has this structure.

D · APPENDIX D — DERIVATION The "three curves" made rigorous — statistical burden, the rate-distortion bound, and exactly when the bottleneck backfires ▸

The page asserts "Cola wins only when the data has a certain shape." Appendix D turns that into inequalities you can check, via a single population-level accounting.

① One ledger for every model: total statistical burden (D.5–D.9)

$$ \E[-\mathcal{L}_{\mathrm{ELBO}}]=H(p_{\text{data}})+\KL(p_{\text{data}}\|p_{\theta,\psi})+\underbrace{\mathcal{G}^{\text{infer}}_{\text{Cola}}}_{\ge 0},\qquad \mathfrak{R}_{\text{Cola}}:=\mathcal{E}(\mathcal{M}_{\text{Cola}})+\inf\,\mathcal{G}^{\text{infer}}_{\text{Cola}} $$

Every family's population risk is $H(p_{\text{data}})+\text{model mismatch}+\text{objective gap}$. Cola pays an extra inference gap $\mathcal{G}^{\text{infer}}=\E\,\KL(q_\phi\|p_{\theta,\psi}(z_0\mid x))\ge 0$ that AR (exact NLL) never pays. So the clean verdict (D.9): Cola beats AR iff $\mathfrak{R}_{\text{Cola}}<\mathfrak{R}_{\text{AR}}$ — superiority is never automatic from "more machinery."

② The rate-distortion curve behind Curve 1 (D.13–D.14)

The mutual-information identity $H_q(X\mid Z_0)=H(p_{\text{data}})-I_q(X;Z_0)$ turns the reconstruction floor into a rate problem:

$$ \E[-\log p_\theta(x\mid z_0)]\ge H(p_{\text{data}})-I_q(X;Z_0),\qquad \mathcal{D}(R):=\!\!\inf_{q:\,I_q(X;Z_0)\le R}\inf_{p_\theta}\E[-\log p_\theta(x\mid z_0)] $$

Plain English: $\mathcal{D}(R)$ is the best reconstruction you can buy if the latent is allowed at most $R$ nats. If it drops fast at small $R$, a cheap global summary exists → the bottleneck helps. If you only get reconstruction near $R\!\approx\!H(X)$, the data is near-incompressible → the bottleneck is pure overhead. This is the slider in the widget above.

③ When it matches the true mechanism — and when it backfires (D.15–D.19)

If $p_{\text{data}}(x)=\int p^\star(x\mid g)p^\star(g)\,dg$ with $H(X\mid G)\ll H(X)$ and $\dim G\ll \dim E(X)$, Cola's inductive bias is the data's structure: it splits one hard problem into "learn $p^\star(g)$ + learn $p^\star(x\mid g)$" (D.17). Where that fails, the three explicit costs (D.18) — inference gap, the elevated reconstruction floor $H(X\mid Z_0)$ from the bottleneck, and joint-training complexity — dominate. And the variational gap is always present (D.19): $\log p(x)-\mathcal{L}_{\mathrm{ELBO}}=\KL(q_\phi\|p(z_0\mid x))$. Success is a competition among three curves — $\mathcal{D}(R)$, prior-approximation, and the inference gap — and only when all three favor Cola is the decomposition a real advantage.

PAGES 10–11 · §4.2 RQ1

RQ1Does a global semantic structure exist within the latent space?

Catching invisible structure with a timeshift

SCALE

VAE 500M + DiT 1.8B

≈2B total — matched against AR (LLaMA) & LLaDA with ~1.8B non-embedding backbones.

RECIPE

OLMo 2 tokenizer · AdamW

LR 1e-6 → warm to 1.5e-4 (5k steps) → cosine to 1e-5 by 1M steps. No EMA. Seq len 512.

EVAL

Unified few-shot

Strict string-match accuracy across multiple-choice & generative tasks — because PPL ≠ quality (§5.1).

BENCHMARKS

8 datasets

LAMBADA, MMLU, SIQA (internal) + SQuAD, Story Cloze, OBQA, RACE, HellaSwag (external).

IMPLICATION 1 · THE CONTRAPOSITIVE

"If the latent representation is purely local and fully separable, then the optimal timeshift does not drift as the latent dimension changes. Therefore, if the optimal timeshift is observed to shift systematically with the latent dimension, this indicates the existence of cross-dimensional shared structures — and if it shows up in semantic metrics, those structures relate to high-level semantics."

Plain English: you can't see "global semantics" directly. So the paper sets a trap. Timeshift = where on the noise schedule the model spends its denoising effort. If the latent were just independent per-dimension noise, the best timeshift would be fixed regardless of dimension d. If it drifts with d, something is shared across dimensions — that's the fingerprint of global structure.

C · FIGURE 2, LIVE — TIMESHIFT DRIFT

Normalized task-average score vs. timeshift loc. The peak is the optimal timeshift for that latent dimension — watch it march right as d grows.

Best loc by metric (Fig 2, right panel)

	d=16	d=64	d=128

❶ Systematic drift

Best loc for Task Avg shifts 1.0 → 1.7 → 2.3 as d = 16 → 64 → 128. Clear, near-monotonic. Directly contradicts the separable null hypothesis.

❷ Consistent across metrics

LAMBADA, MMLU, SIQA & Task Avg all favor larger loc at higher d. Not a single-task fluke — a structure shared across semantic tasks.

❸ Matches theory

Empirical peaks sit close to the Appendix-E predicted positions (dashed lines), drift directions fully consistent. Not a hyperparameter accident.

Verdict. Implication 1's contrapositive is satisfied → strong evidence of shared, semantically-relevant global structure in Cola's latent space. This also supports the first condition of Eq 3.35 — the data does have low-dimensional global semantics.

D · APPENDIX E — DERIVATION The falsifiable trap, the proof that drift refutes it, and where the $\delta^\star(d)=a\log d+b$ law comes from ▸

Implication 1 is a contrapositive — to wield it you need (a) a precisely stated null hypothesis, (b) a theorem that the null forbids drift, and (c) a structural model that predicts the shape of the drift when the null fails. Appendix E supplies all three.

① The null hypothesis — "purely local & separable" (Assumption E.1, E.1–E.2)

Suppose the objective decomposes additively over independent, identically-behaving latent dimensions, with a shift-response of identical functional form:

$$ \mathcal{J}_d(\delta)=\sum_{i=1}^d j_i(\delta)\quad\Rightarrow\quad \mathcal{J}_d(\delta)=a_d\,j(\delta)+b_d,\quad a_d>0 $$

That is the formal meaning of "no shared structure": dimension $d$ only rescales/offsets the same per-dimension curve $j(\delta)$.

② The theorem: under the null, the optimal shift cannot move (Prop E.2 → Cor E.3)

$$ \delta_d^\star=\arg\max_\delta\mathcal{J}_d(\delta)=\arg\max_\delta\big[a_d\,j(\delta)+b_d\big]=\arg\max_\delta j(\delta)\quad\text{(independent of }d\text{)} $$

Plain English: a positive rescale $a_d$ and a constant offset $b_d$ never move the location of a maximum. So if the latent were truly separable, the best timeshift would be pinned across $d$. The contrapositive (Cor E.3): an observed stable, monotonic, reproducible drift — not explainable by parameter count, under-training, or noise — rejects the null. The drift in the widget above is that rejection.

③ Information-theoretic restatement: the schedule controls $I(s;z_t)$ (E.5–E.8)

Write the forward process $z_t=\alpha_t z+\sigma_t\epsilon$ and decompose the latent into a semantic signal plus residual, $z=\phi(s)+u$. Then what reaches the DiT is $z_t=\alpha_t\phi(s)+(\alpha_t u+\sigma_t\epsilon)$ — so what matters is not the raw timestep but how much information about $s$ survives. Under the separable null, $I(s;z_t)=\sum_i I(s_i;z_{t,i})$ and varying $d$ only rescales it — again no shift in the optimal regime.

④ The shared-factor model that predicts the log-law (E.9–E.13)

Let many dimensions observe one low-dimensional shared factor, $z_i=A_i g+\xi_i$. Standard linear-Gaussian inference then gives a recovery SNR that grows with $d$, and a recoverable-information that grows logarithmically:

$$ \mathrm{SNR}_{\text{eff}}(d)\propto d,\qquad I(g;z_t)\approx\tfrac r2\log\!\big(1+c\,d\,\mathrm{SNR}_{\text{eff}}(t)\big)\;\;\Rightarrow\;\;\boxed{\,\delta^\star(d)=a\log d+b\,}$$

More dimensions watching the same factor ⇒ stronger effective SNR ⇒ the shift must compensate logarithmically to keep training in the same semantic-recovery regime. This is the dashed "Appendix-E prediction" line the widget plots — and it is structurally homologous to the resolution-dependent timestep shift in Stable Diffusion (Remark E.4).

⑤ Why the VAE logSNR also moves the optimum (E.14–E.15)

Even at fixed $d$, lowering the VAE logSNR raises posterior variance $\Sigma_u$, so the total noise seen by the semantic variable is $\Sigma_{\text{noise}}(t)=\alpha_t^2\Sigma_u+\sigma_t^2 I$. A smoother latent ⇒ the same raw timestep corresponds to a lower effective semantic SNR ⇒ the shift must be recalibrated. Latent dimension and VAE logSNR look like two different knobs but act on one object: the effective mutual-information curve $I(s;z_t)$ along diffusion time. (This is the bridge to the noise-schedule deep-dive on the RQ3 page.)

PAGES 11–15 · §4.3 RQ2

RQ2What type of latent space is optimal for text generation?

The latent should evolve — from a stable start

Three sub-questions: should the latent be fixed or evolving? What dimensionality? How much semantic smoothness? The headline: neither frozen nor trained-from-scratch — let it co-evolve with the DiT on top of a good initialization, keep it semantically smooth (BERT loss + learnable logSNR), and bigger latent dims carry more semantics.

C · FIGURE 3 — FIXED vs EVOLVING (toggle strategies)

❶ Joint DiT ×1 has the strongest scaling — Fix VAE saturates early; continuous co-adaptation lifts the ceiling.

❷ Benefit comes from good initialization, not trainability alone — All Scratch stays poor throughout.

❸ Weak updates (×0.01) & periodic freezing (Interval) both lag — evolve continuously & strongly.

C · FIGURE 4 — LATENT GEOMETRY

Table 2 · Latent dimensionality (117 EFLOPs, all-scratch, loc=1)

Larger latent dims raise the overall average — more semantic capacity.

Method	Lambada	MMLU	SIQA	Avg
d=16	14.3	6.9	4.9	8.7
d=64	20.9	5.4	7.6	11.3
d=128	18.5	8.1	8.9	11.8

Avg climbs 8.7 → 11.3 → 11.8. Bigger ≠ pure win though: it partly fixes collapse but also shifts the noise calibration (why timeshift drifts in RQ1).

Table 3 · VAE logSNR (smoothness)

Learnable logSNR (≈4.5) wins; fixed 1.5 is the best fixed alternative.

logSNR	77.86 EF		116.78 EF
	SIQA	Avg	SIQA	Avg
Fixed 1.0	11.3	14.7	18.4	18.8
Fixed 1.5	17.5	18.3	23.6	21.8
Fixed 2.0	14.3	16.8	19.5	20.6
Learnable	16.2	18.9	21.6	22.1

BERT loss (Fig 5): adding a masked-token loss at full VAE learning-rate (lr=1) consistently beats no-BERT — masked-token recoverability keeps the latent semantically useful. Smoothness matters most when the latent actively evolves.

PAGES 16–19 · §4.4 RQ3

RQ3Which diffusion process is most effective for text generation?

Tuning the denoiser: block size, schedule, steps, guidance

Four knobs decide how good the prior gets. The winning recipe: block size 16, noise schedule loc=1.0, ~10–32 denoising steps, and a moderate CFG ≈7. Every one of these is a "Goldilocks" — too little or too much hurts.

C · FIGURE 9 — INFERENCE DIALS

Drag denoising steps (saturating gain) & CFG scale (inverted-U). The dashed line is the paper's reference Task Average.

denoising steps16

1→128 (log). Most gain by 8–10 steps → with block size 16, that's a 1.6–2.0× generation-depth cut vs AR.

CFG scale7.0

Peaks at ~3–7; beyond ≈10 guidance distorts the denoising trajectory. CFG=60 collapses to ~10.

C · FIGURE 6 — DiT BLOCK SIZE

Block size 16 wins. Size 1 (fully causal) is competitive but weaker — some local grouping helps. Sizes 64/128 degrade badly on MMLU/SIQA: coarse partitioning weakens semantic interactions.

IMPLICATION 2 · SCHEDULE = SEMANTIC CALIBRATION

"If the schedule location shifts the logSNR curve, then it also shifts the effective semantic-information regime the DiT sees during denoising. The best noise schedule is the one whose logSNR trajectory is best aligned with the latent space and the semantic scale to be recovered — not a universally fixed timestep parameterization."

Plain English: the noise schedule isn't a throwaway hyperparameter. It decides where on the noise axis the model spends its effort. loc=1.0 parks that effort in the regime where semantics actually live → best results (Fig 7,8). It's the same "core object" as logSNR, latent dim, and timeshift drift from RQ1.

BLOCK 16

Best balance of local capacity & semantic aggregation, both checkpoints.

LOC 1.0

Highest Task Avg; especially clear gains on MMLU & SIQA. Uniform schedules lag.

~16–32 STEPS

Big early gains 1→8, then flat. More is not better.

CFG ≈ 7

Moderate guidance optimal; excessive guidance severely degrades.

D · APPENDICES G, H.7 & H.9 — DERIVATION What "timeshift" really is: noise-schedule ⟺ logSNR, the two ways logSNR enters the FM loss, and the LogitNormal knob ▸

Implication 2 says "the schedule calibrates the semantic-information regime." Appendix G proves the schedule is not an external hyperparameter at all — it is baked into the training geometry — and H.7/H.9 give the exact quantities the experiment dials.

① Noise schedule and logSNR are the same object (G.1–G.8)

$$ \lambda(t):=\log\frac{\alpha_t^2}{\sigma_t^2},\qquad \alpha_t^2=\mathrm{sigmoid}(\lambda(t)),\ \sigma_t^2=\mathrm{sigmoid}(-\lambda(t))\quad\Rightarrow\quad \text{schedule}\iff\text{logSNR curve} $$

Specifying $\lambda(t)$ fixes $(\alpha_t,\sigma_t)$ and vice-versa. A timeshift $\lambda_\delta(t)=\lambda(t)+\delta$ therefore doesn't reweight a loss after the fact — it re-maps the same raw timestep to a different logSNR interval.

② logSNR enters the Flow-Matching loss in two ways (G.14–G.22)

Change variables $t\to\lambda$ in the FM objective. The uniform-$t$ measure pushes forward to a non-uniform measure on the logSNR axis, and the supervised target velocity rescales:

$$ \mathcal{L}_{\mathrm{FM}}=\int w_{\text{eff}}(\lambda)\,\E\big[\|\tilde v_\psi(z_\lambda,\lambda)-u_\lambda\|^2\big]\,d\lambda,\quad w_{\text{eff}}(\lambda)=\Big|\tfrac{dt}{d\lambda}\Big|,\quad u_t=\dot\lambda(t)\,u_\lambda $$

Plain English: shifting the schedule changes (i) which noise regimes get sampled most, and (ii) how hard the regression target is in each regime. So uniform-timestep training is not equivalent to uniform-logSNR training unless $\lambda(t)$ is affine (Prop G.1). The schedule is part of the objective, not a knob bolted on top.

③ What is actually calibrated: the semantic-information curve (G.31–G.34)

$$ I(s;z_t)=\tfrac12\log\det\!\big(I+\alpha_t^2\Sigma_s(\alpha_t^2\Sigma_u+\sigma_t^2 I)^{-1}\big),\qquad \delta^\star=\arg\max_\delta \mathrm{Perf}\big(I_{\text{eff},\delta}(t;d,\Sigma_u,\mathcal{G},B,\vartheta)\big) $$

The schedule controls the curve $t\mapsto I(s;z_t)$. Choosing the timeshift is therefore an effective-semantic-information calibration problem — it depends jointly on latent dimension $d$, posterior uncertainty $\Sigma_u$, latent geometry $\mathcal{G}$, and block size $B$. (Remark G.3: block size has no closed-form law but couples to the schedule through the same curve — which is why block 16 and loc 1.0 are co-selected.)

④ The two quantities the experiment dials (H.7, H.9)

$$ \text{logSNR}_{\text{vae}}=\log\frac{\E_{x,i}[\mu_{\phi,i}(x)^2]}{\E_{x,i}[\sigma_{\phi,i}(x)^2]}\qquad\text{(H.2)};\qquad t=T\cdot\mathrm{sigmoid}(u),\ u\sim\Nrm(\mu,\sigma^2)\ \text{(H.7)} $$

The VAE logSNR (H.7) is the signal-to-noise of the encoder posterior — larger ⇒ cleaner, more deterministic latent. The timestep shift is implemented as a LogitNormal sampler: $s=t/T\sim\mathrm{LogitNormal}(\mu,\sigma^2)$, density $p(s)=\frac{1}{\sigma\sqrt{2\pi}}\frac{1}{s(1-s)}\exp\!\big(-\frac{(\log\frac{s}{1-s}-\mu)^2}{2\sigma^2}\big)$ (H.9). Larger $\mu$ pushes sampling mass toward later timesteps; $\sigma$ controls how concentrated it is. That is precisely the "loc" the widget sliders and Fig 17 visualize — a reshaping of which logSNR regime is emphasized, not a numeric reweighting.

PAGES 20–21 · §4.5 RQ4

RQ4Why use a continuous latent diffusion model for language modeling?

Does it scale? Against matched AR & LLaDA — yes

The decisive test. Under the best config (d=16, block 16, joint training lr-ratio 1, BERT loss, loc=1, 16 steps, CFG=7), Cola DLM is compared to strictly-matched AR (LLaMA) and LLaDA — both with 1.8B non-embedding backbones, same data, up to ~2000 EFLOPs. The result: strong, persistent scaling, best final Task Average.

C · FIGURE 10 — SCALING EXPLORER

compute budget1000 EFLOPs

At this budget

AR LLaDA Cola

❶ Among the strongest overall trends

On Task Average, Cola improves steadily and reaches the best final. AR competitive at small budgets; Cola rises more persistently into the high-compute regime.

❷ Especially clear on reasoning & global-semantic tasks

On MMLU, RACE, Story Cloze, OBQA, a strong upward trend and best/near-best performance — exactly the tasks needing global semantic organization.

❸ Encouraging on generative tasks

On LAMBADA, tracks AR closely. On SQuAD, a clear gain with scale — eventually surpasses AR and approaches LLaDA's strong region.

❹ A conservative estimate

This is a restrained config (d=16). RQ2 showed d→128 adds capacity; logSNR analysis shows more headroom. The real ceiling is higher than shown.

Note on absolute numbers. Multiple-choice scores look low because everything is cast into a unified few-shot generative protocol (not likelihood-based classification) — for fairness, and because PPL ≠ generation quality (next section). The relative scaling trends are what matter, and they robustly favor Cola.

PAGES 22–24 · §5.1 DISCUSSION

Why perplexity lies about a latent model

A central, counter-intuitive phenomenon: generation can already be good while likelihood-oriented PPL stays terrible. They measure different things. Generation only needs the prior's mass to reach semantically decoder-valid regions. Likelihood additionally needs accurate local density calibration right around the gold posterior.

B · EQ 5.1 — THE CONDITIONAL MARGINAL

$$ p(x^{\text{res}}\mid c)=\int p_\theta(x^{\text{res}}\mid z,c)\,p_\psi(z\mid c)\,dz $$

x^res	the response being scored.
c	the conditioning context induced by the prefix/prompt.
p_θ(x^res\|z,c)	decoder likelihood of the response given a latent and context.
p_ψ(z\|c)	the conditional prior over latents given context.
∫ … dz	marginalize over all latents — the exact conditional probability (what generation needs).

Plain English: the true probability of a response given context c sums over all latents the prior might produce. For good generation, you just need some high-prior latent z to land where the decoder writes valid text.

B · EQ 5.2 — THE ACCESSIBLE LOCAL SCORE

$$ \mathcal{S}_{\text{resp}}(x)=\E_{q_\phi(z|x,c)}\!\big[\log p_\theta(x^{\text{res}}|z,c)+\log p_\psi(z|c)-\log q_\phi(z|x,c)\big] $$

𝒮_resp(x)	the accessible local score — the ELBO-style / PPL proxy actually evaluated.
𝔼_{q_φ(z\|x,c)}	averaged only over the encoder posterior for the gold text — a narrow neighborhood.
log p_θ(x^res\|z,c)	decoder term (reconstruction of the gold response).
log p_ψ(z\|c) − log q_φ	prior minus encoder — the local calibration term that PPL is sensitive to.

The mismatch: this PPL-style score is computed only near the encoder's posterior for the gold text. It demands the prior be precisely calibrated there — a much harsher requirement than "reach a valid region somewhere."

C · COVERAGE (Eq 5.1) vs CALIBRATION (Eq 5.2) — WHY PPL LIES

Move the prior around the latent plane. Generation (Eq 5.1) only needs prior mass to land anywhere in the broad decoder-good region. PPL (Eq 5.2) needs prior density piled on the narrow gold tube. Watch the two metrics disagree.

prior center—

prior spread σ—

GENERATION · coverage—

PPL · calibration—

C · FIGURE 11 — LATENT GEOMETRY AROUND A TOKEN

decoder-valid neighborhood posterior cloud (gold) prior cloud ★ reference latent

IMPLICATION 3

"Good generation & good likelihood-oriented estimation are not equivalent. Generation depends on whether the prior reaches semantically valid latent regions; likelihood additionally depends on local density calibration around the gold posterior neighborhood."

IMPLICATION 4

"Generation quality relates to semantic smoothness of the latent space; likelihood-oriented PPL is more sensitive to probability-space smoothness shaped by the VAE logSNR. These two smoothnesses differ → generation and PPL need not align."

Table 4 evidence — lower PPL ≠ better generation

For the token "at", likelihood-derived PPL improves dramatically 1.15×10⁶ → 641.57 → 245.36 across logSNR settings — yet the generated token degrades from a sensible "on" to a comma. For "her", smaller PPL under fixed logSNR fails to recover the correct token. Direct training has much worse PPL but sometimes preserves the right semantic behavior. Flatter logSNR smooths the density (better PPL) but blurs semantics toward generic words like "in/the/went".

D · APPENDIX F — DERIVATION Why a continuous latent LM can generate well yet score terrible PPL — the four theorems behind the paradox ▸

The four implications above are proved in Appendix F, by separating two geometric objects: a broad "decoder-good region" that generation needs to reach, and a narrow "gold tube" that PPL needs to calibrate.

① Flow Matching regresses the mean velocity, not a gold-specific density (Prop F.1)

$$ f^\star(z,t,c)=\E\big[U^\star\mid Z_t=z,t,c\big] $$

The squared FM loss has a unique optimum: the conditional-mean velocity. When the conditional response distribution is multimodal or broad-peaked, FM learns an average transport into a reasonable region — it never promises local density calibration around any one gold sample. That is the root cause.

② Multimodality ⇒ prior mean is far from any gold latent (Cor F.2)

$$ p_\psi(z\mid c)=\sum_m \pi_m\Nrm(\mu_m,\Sigma_m),\quad \|\bar\mu_p(c)-\mu_{m^\star}\|\le\sum_{m\neq m^\star}\pi_m\|\mu_m-\mu_{m^\star}\| $$

If the context admits several valid continuations, the prior's global mean sits between the modes — far from the gold latent the posterior selected. Generation is still fine as long as the modes' mass lands in a decoder-good region.

③ Coverage vs calibration: two different sets (Prop F.3)

$$ A^\tau_{\text{good}}(c)=\{z:\E\,r(x^{\text{res}};c)\ge\tau\}\ \ \text{(broad)}\qquad A^\rho_x=\{z:q_\phi(z\mid x,c)\ge\rho\}\subseteq A^\tau_{\text{good}}\ \ \text{(narrow gold tube)} $$

Plain English: generation only needs the prior to drop an $\alpha$-fraction of mass somewhere in the big good region — a coverage requirement. Conditional PPL needs the prior to put high local density on the thin gold tube of one specific response — a calibration requirement. Prop F.3 shows both can hold at once: good samples, yet $\mathcal{S}_{\text{resp}}\le B-\Delta$, an arbitrarily biased PPL.

④ Even centered priors fail; and why AR/LLaDA don't (F.18–F.20, F.23–F.28)

$$ \mathcal{S}_{\text{resp}}=\underbrace{R(x;c)}_{\text{reconstruction}}-\underbrace{\KL(q_\phi\|p_\psi)}_{\text{posterior–prior gap}},\qquad \KL=\tfrac12\big[\mathrm{tr}(\Sigma_p^{-1}\Sigma_q)+\Delta\mu^\top\Sigma_p^{-1}\Delta\mu-d+\log\tfrac{\det\Sigma_p}{\det\Sigma_q}\big] $$

Good reconstruction $R\to R_{\max}$ does not imply good PPL if the KL gap stays positive (Prop F.4). And even if the centers align ($\mu_p\approx\mu_q$), the scale/orientation/volume terms in the Gaussian KL keep PPL poor (Prop F.5) — a too-sharp posterior amplifies this. AR is immune because training = the object PPL evaluates = the object generation uses (F.25): $-\log p^{\text{AR}}=\sum_i-\log p(x_i\mid x_{<i})$. Continuous latents add latent-integration, posterior–prior matching, and decoder compatibility on top — which is why PPL behaves like a density-calibration metric, not a generation-quality metric.

PAGE 24 · §5.2 FIRST-BLOCK CONDITIONING

The tricky first block: known prompt + unknown words

The very first generation block is mixed — it holds known prompt latents and latents to be generated. How you treat the known part decides everything. Four strategies were tried; one dominates: keep the known region clean and fixed throughout denoising.

C · FIGURE 12 — PICK A STRATEGY

Table 5 · Impact of first-block conditioning (avg accuracy)

Task	Repaint t=1 (m=1/.7/.3)	Repaint t=3 (m=1/.7/.3)	Clean cond.	Left pad	Right pad
Lambada	8.5/8.5/6.6	7.0/7.3/5.6	37.1	24.6	24.7
MMLU	7.9/7.9/7.8	7.6/6.7/7.0	11.9	8.4	11.5
SIQA	8.8/8.7/8.2	13.3/13.0/12.0	24.8	14.9	13.8
Avg	8.4/8.4/7.5	9.3/9.0/8.2	24.6	16.0	16.7

Takeaway: clean condition repaint wins everywhere (avg 24.6 vs ~16 for padding, ~8–9 for partial repaint). The first block's mixed denoising needs strong, persistent conditioning — re-noising the known region (partial repaint) or just shuffling layout (padding) both fail. Shortening the guided fraction m hurts; more repaint repetitions t don't help.

D · APPENDIX I.1 — DERIVATION A Flow-Matching account of why "clean condition" wins — the role mismatch, the variance gap, and the error-accumulation bound ▸

Table 5 is a leaderboard; Appendix I.1 explains why the order is what it is — from the structure of conditional Flow Matching in the first block, where a known prompt region and an unknown region coexist.

① The first block is special: it must transport under a fixed condition

Cola's prior is learned block-by-block as a conditional flow $p_\psi(z_0)=p_\psi(z_0^{(1)})\prod_{b\ge2}p_\psi(z_0^{(b)}\mid z_0^{(<b)})$, predicting a noisy block under clean historical conditions. Decompose the first block as $z^{(1)}=(z_K,z_U)$ (known / unknown). The mathematically correct task is:

$$ \text{generate } z_U \text{ under the fixed condition } (z_{\text{pre}},z_K)\quad\Longleftrightarrow\quad p_\psi(z_U\mid z_{\text{pre}},z_K) $$

The known part is a boundary condition, only the unknown part is transported by the flow.

② Clean condition matches the optimal field; partial repaint corrupts the condition

$$ v^\star_{\text{clean}}(z_{U,t},t)=\E[u_t^U\mid z_{U,t},t,z_{\text{pre}},z_K]\quad\text{vs}\quad v^\star_{\text{partial}}=\E[u_t^U\mid z_{U,t},t,z_{\text{pre}},\tilde z_{K,t}] $$

Clean conditioning solves the transport under exactly the intended condition $(z_{\text{pre}},z_K)$. Partial repaint replaces the true known region with a degraded, time-varying surrogate $\tilde z_{K,t}$ — a different, noisier regression target.

③ The variance gap + the role mismatch (Bayes risk)

$$ \mathcal{R}(\tilde c_t)=\E[\mathrm{Var}(u_t^U\mid z_{U,t},t,\tilde c_t)]\;\ge\;\mathcal{R}(c)=\E[\mathrm{Var}(u_t^U\mid z_{U,t},t,c)] $$

Plain English: a noisier condition is compatible with more clean targets, so the irreducible variance of the velocity regression rises. Worse, it's a role mismatch: in Cola the flow path is for prior transport, and historical conditions are supposed to be stable anchors. Partial repaint demotes the known region from "fixed condition" to "partially-denoised state variable" — it changes the task from transport the unknown under a fixed condition to jointly maintain a noisy known part and transport the unknown.

④ Why the error compounds — and why padding sits in between

$$ \tilde v(z,t)=v^\star(z,t;c)+\delta(z,t),\qquad \|\hat z_t-z_t^\star\|\le e^{Lt}\!\int_0^t\|\delta(z_s,s)\|\,ds $$

Because inference integrates the learned field along an ODE, the condition-induced bias $\delta$ accumulates over the trajectory ($L$ = Lipschitz constant). This is why reducing the guided fraction $m$ hurts (more unguided interval = more accumulation) and why more repaint cycles $t$ don't help (repeated early corrections can't turn a transient condition into a persistent one). Left/right padding never re-noises the known region, so it avoids the worst failure — but it only rearranges layout, never locks the condition exactly, and it complicates the block-causal attention pattern. Hence the strict ordering: clean cond ≫ padding ≫ partial repaint, exactly as the table shows.

PAGES 25–26 · §5.3–5.4

Can you compress the latent? Yes — if you align boundaries

Two VAEs are compared at d=128: p1 maps each token to one latent; p2 compresses every two tokens into one. Overall p2 looks worse — but the whole gap comes from odd-length prompts. On even lengths, p2 actually wins.

C · PATCH-SIZE BOUNDARY DEMO

prompt len 6

Table 6 · Patch size × prompt parity (avg)

	Overall		Mod0 (even)		Mod1 (odd)
	p1	p2	p1	p2	p1	p2
Lambada	31.1	17.4	32.1	34.6	30.1	0.8
MMLU	5.4	3.9	6.9	7.7	3.9	0.0
SIQA	11.1	6.1	12.9	12.1	9.3	0.0
Avg	15.9	9.1	17.3	18.1	14.4	0.3

On even (Mod0) lengths p2 ≥ p1 on average — compression helps! On odd (Mod1) lengths p2 collapses to ~0.

IMPLICATION 5

"The weakness of patch size 2 does not mainly come from compression itself, but from the boundary case where the prompt length is not divisible by the patch size. Once the latent grouping is well aligned with the text sequence, compression can instead become beneficial."

Why it's fatal here: the compressed prompt latent is the clean condition for all later blocks. An odd-length boundary biases that latent, the error propagates through denoising, and conditional decoding fails → near-zero Mod1. Fix the boundary and larger patches give both stronger semantic abstraction and faster generation (more tokens per latent).

C · FIGURE 13 — VAE RECONSTRUCTION ROBUSTNESS

Drag the diffusion noise. The VAE reconstructs near-perfectly at t=0 (acc 0.9998) and degrades gracefully — semantics aren't destroyed by small/moderate perturbations.

diffusion timestep t0

PAGES 26–28 · §5.5–7

A bridge from text to a shared continuous mind

Because Cola already maps discrete text into a continuous latent, it offers a natural bridge to other continuous modalities. Map each modality to its own latent, then let a single block-causal MMDiT prior organize the joint semantics, while modality-specific decoders handle realization. Continuity enters at the level of the prior, not the pixels or tokens.

B · THE JOINT LATENT

$$ z_0^{\text{text}}\!\sim q_{\phi_{\text{text}}}(z|x^{\text{text}}),\ \ z_0^{\text{img}}\!\sim q_{\phi_{\text{img}}}(z|x^{\text{img}}) $$

$$ \tilde z_0=(z_0^{\text{text}},z_0^{\text{img}}),\quad p(x^{\text{text}},x^{\text{img}},\tilde z_0)=p_\theta(x^{\text{text}},x^{\text{img}}\mid\tilde z_0)\,p_\psi(\tilde z_0) $$

z₀^text, z₀^img	per-modality latents from separate encoders q_{φ_text}, q_{φ_img}.
z̃₀	the concatenated joint latent the shared prior organizes.
p_θ(x^text,x^img\|z̃₀)	modality-specific decoders realizing each surface from the joint latent.
p_ψ(z̃₀)	one shared MMDiT prior over the joint latent — where cross-modal semantics live.

Plain English: each modality has its own encoder/decoder for surface detail; the shared prior models the higher-level semantic structure and cross-modal dependency in latent space. Unified modeling = a shared semantic prior over heterogeneous observations, not just one backbone with shared weights.

B · THE UNIFIED ELBO

$$ \E[\mathcal{L}_{\mathrm{ELBO}}]=\E_q[\log p_\theta(x^{\text{text}},x^{\text{img}}|\tilde z_0)]-I\big((X^{\text{text}},X^{\text{img}});\tilde Z_0\big)-\KL(\bar q(\tilde z_0)\|p_\psi(\tilde z_0)) $$

𝔼_q[log p_θ(·\|z̃₀)]	joint reconstruction of both modalities from the shared latent.
I((X^text,X^img);Z̃₀)	information the joint latent stores about both observations — the shared compression rate.
KL(q̄(z̃₀)‖p_ψ)	prior-matching gap for the joint aggregated posterior — same role as Eq 3.5, now multimodal.

Same shape as text-only. The latent carries compressed global semantics; decoders handle modality-specific realization. The exact division of labor from Eq 3.5 — now spanning text and images.

C · FIGURE 14 — SHARED MMDiT, THREE TASKS

§6 · Limitations & Future Prospects

Scale. A controlled-scale study — the true ceiling under bigger models, longer training, more compute is untested.
Design. VAE strategy, compression, latent dim, smoothness, joint logSNR / block size / schedule all matter; stronger latents likely need better noise calibration.
Framework. The value is the decomposition, not denoising. Opens doors to stronger latent modules (AE, RAE) & flexible prior learning (drifting-model distribution matching), and to more modalities.

§7 · Conclusion

Cola DLM decomposes text generation into global semantic prior modeling in latent space + local textual realization via conditional decoding — a principled alternative to strictly token-level LM. The study consistently finds: evidence of shared global semantic structure, effective design choices for latent & diffusion, strong generation quality and encouraging scaling. For this model class, generation quality & scaling trends are more informative than likelihood alone — and the continuous latent offers a concrete path to unified multimodal modeling.

PAGES 29–33 · §8 AFTERWORD

The bigger picture: representation, objective, environment

The afterword zooms out. Learning is never about model structure alone — it's a model–environment interaction system shaped by three jointly-coupled things: how you represent text, what objective you optimize, and what environment you learn in. AR occupies just one self-consistent corner of that design space.

C · THE INTERACTION LOOP (Eq 8.2–8.7)

B · EQ 8.1 + 8.8 — THE SYSTEM

$$ \mathcal{E}=(\Omega,\mathcal{O},\mathcal{A},\mathcal{T},\mathcal{F},\mathcal{G}) $$

$$ \mathcal{J}(\theta;\mathcal{E})=\E_{\tau\sim P(\tau|\theta,\mathcal{E})}\Big[\sum_{t=1}^T\gamma^{t-1}\ell_t\Big] $$

Ω,𝒪,𝒜	state / observation / action spaces.
𝒯,ℱ,𝒢	transition / feedback / gradient rules.
𝒥	discounted return over an interaction trajectory τ.

Plain English: "environment" is broad — it includes the data distribution, task formats, supervision, even the loss rules. Learning optimizes return inside that environment. Change the environment's structure and you change what's worth learning.

C · THE THREE THEMES — AR's CORNER vs COLA's ROUTE

EQ 8.10 · REPRESENTATION

$$ p(x)=\int p_\theta(x|z_0)p_\psi(z_0)\,dz_0 $$

The path no longer acts on observation recovery — it organizes global semantics in a latent state first, then the decoder does local realization. The role of "state" is redefined.

EQ 8.13 · OBJECTIVE

$$ -\mathcal{L}_{\mathrm{ELBO}}=-\log p_{\theta,\psi}(x)+\KL(q_\phi\|p_{\theta,\psi}) $$

Even at the ELBO, the objective is separated from true likelihood by a variational gap. So a PPL mismatch isn't failure — the model is learning something different. Scaling behavior beats any single likelihood number.

EQ 8.17 · ENVIRONMENT

$$ p(\xi,\omega_{t+1}|o_t,a_t)\neq\prod_{m}p_m(\cdots) $$

Real environments are non-separable across modalities — useful feedback depends on joint regularities. So unified models matter not for one backbone, but to learn in an environment that couples modalities. Text needs a continuous interface (Eq 8.18) to join.

The closing thesis. AR is a self-consistent corner: representation bound to surface tokens, objective = direct likelihood, environment = symbolic & text-centered. Cola changes all three at once — a hierarchical latent representation, an objective away from token-level likelihood (weakening PPL's authority), and a continuous interface that lets discrete text enter shared multimodal environments. Not just another text generator — a more systematic way to think about representation, objective alignment, and environment design together.

PAGES 42–53 · APPENDICES A–D

The proofs underneath, in plain sight

Four results make the whole story rigorous: (A) the CNF prior has an explicit log-density; (A) Flow Matching is a solver, not the model; (A/C) the average ELBO decomposes into three information-theoretic roles; (D) a rate-distortion curve decides when the bottleneck is worth it.

APPENDIX A.2 · EXPLICIT CNF DENSITY (A.12–A.13)

$$ \frac{d}{dt}\log p_t(z_t)=-\nabla\!\cdot v_\psi(z_t,t) $$

$$ \log p_\psi(z_0)=\log p_1(z_1)+\int_0^1 \nabla\!\cdot v_\psi(z_t,t)\,dt $$

Plain English: unlike a "prior that exists only through sampling," the flow prior has a computable density. As you follow the flow, the log-density changes exactly by minus the divergence (how the flow expands/contracts volume). Integrate that, add the Gaussian endpoint, done. This is what makes likelihood evaluation possible at all.

APPENDIX A.4 · FLOW MATCHING IS A SOLVER (A.31–A.36)

$$ z_t=(1-\alpha(t))z_0+\alpha(t)z_1,\quad u_t=\dot\alpha(t)(z_1-z_0) $$

$$ v_\psi^\star(z,t)=\E[u_t(z_0,z_1)\mid z_t=z,t] $$

Two objectives, not one: max E[log p_ψ] (A.35) is the strict prior-learning goal; min ℒ_FM (A.36) is a practical solver for the same prior's vector field. They solve the same problem but are not the same object — ℒ_FM can't be identified term-by-term with −log p_ψ. Hence: Cola is a hierarchical latent-variable LM; flow is just how the prior is made expressive.

C · CONDITIONAL PATH & TARGET VELOCITY (A.31–A.32)

A single (z₀, z₁) pair interpolates along z_t=(1−α)z₀+αz₁. Flow Matching regresses the network's velocity onto the target u_t=α̇(z₁−z₀). Bend the α-schedule and watch the path & speed change.

α-schedule curvaturelinear

time t0.50

APPENDIX A.3 / A.5 · THREE ROLES OF THE ENCODER (A.28, A.40)

$$ \E[\mathcal{L}_{\mathrm{ELBO}}]=\E_q[\log p_\theta(x|z_0)]-I_q(X;Z_0)-\KL(\bar q_\phi\|p_\psi) $$

$$ L_{\text{Total}}^{\text{strict}}=L_{\text{REC}}+L_{\text{PRIOR}}+L_{\text{REG}}=-\mathcal{L}_{\mathrm{ELBO}} $$

The encoder decides three things at once: the target q̄_φ the prior must fit, the compression rate I_q(X;Z₀), and thus the division of labor between latent-semantics & decoder-realization. The strict training loss is exactly reconstruction + prior + regularization = −ELBO.

APPENDIX D · RATE-DISTORTION & STRUCTURE (D.14–D.17)

$$ \mathcal{D}(R):=\inf_{q:\,I_q(X;Z_0)\le R}\ \inf_{p_\theta}\ \E_{q}[-\log p_\theta(x|z_0)] $$

$$ \text{learning }p_{\text{data}}(x)\ \leadsto\ \text{learning }p^\star(g)\text{ and }p^\star(x|g) $$

The deciding curve: D(R) = the best reconstruction cost when the latent may carry ≤ R nats. If D(R) is already low at small R, the data has a cheap sufficient summary G → splitting "learn p(x)" into "learn p(g) + p(x|g)" matches the true mechanism → the bottleneck helps. If reconstruction needs nearly all the bits, compression only hurts.

🥤

The whole paper in one breath

Cola DLM stops treating language as a left-to-right token chain and starts treating it as global meaning (a continuous latent, transported from noise by a block-causal flow prior) + local wording (a conditional decoder). It is honest about when this wins — only when data has low-rate global semantics — and proves it does, empirically (timeshift drift, scaling) and theoretically (three governing curves). Along the way it shows perplexity lies about latent models, and opens a clean bridge to unified multimodal generation.

The model

VAE → block-causal DiT prior → conditional decoder. p(x)=∫p_θ(x|z₀)p_ψ(z₀)dz₀.

The mechanism

Diffusion transports a prior, not an observation. Flow Matching is just the solver.

The evidence

Global structure exists (RQ1); evolve the latent (RQ2); block 16 + loc 1 + CFG 7 (RQ3); best scaling (RQ4).

The lesson

For latent LMs, generation quality & scaling — not perplexity — reflect true capability.

↑ Back to the top

The Continuous LatentDiffusion Language Model

Why break free of left-to-right?

Three goals nobody hits at once

Diffusion as prior transport

CoLa DLM

A new paradigm: hierarchical information decomposition

The landscape: where does continuity live?

The generative model: decoder × prior

One objective, three jobs

The continuous diffusion field, in 3D

How do you score a sentence you never wrote directly?

Two training stages, one inference cascade

The block-causal mechanism — the heart of the DiT

Generation: encode the prefix, transport blocks, decode

One frame to rule them all: paths over state spaces

When does the latent bottleneck help?

Catching invisible structure with a timeshift

The latent should evolve — from a stable start

Tuning the denoiser: block size, schedule, steps, guidance

Does it scale? Against matched AR & LLaDA — yes

Why perplexity lies about a latent model

The tricky first block: known prompt + unknown words

Can you compress the latent? Yes — if you align boundaries

A bridge from text to a shared continuous mind

The bigger picture: representation, objective, environment

The proofs underneath, in plain sight

The whole paper in one breath

The Continuous Latent
Diffusion Language Model