The Continuous Latent
Diffusion Language Model
Text generation need not be tied to a fixed left-to-right order, nor to noisy token-level recovery. Cola DLM reframes it as hierarchical information decomposition: a Text VAE compresses words into a continuous latent, a block-causal DiT learns a prior over that latent by flow transport, and a conditional decoder writes the text.
live: noise → structured latent transport · drag to orbit
Why break free of left-to-right?
Autoregressive (AR) language models factorize text with the chain rule and predict one token at a time, left to right. That is enormously effective — but it welds generation to a single token ordering, makes inference inherently sequential, and bakes in a strong hand-crafted inductive bias. The paper's opening claim: high-quality language generation does not require a fixed order, and need not be defined by recovering tokens at all.
Three goals nobody hits at once
Existing paradigms each sacrifice one of: generation efficiency, scalable representation learning, and global semantic modeling. The paper's whole motivation is to get all three together.
Diffusion as prior transport
The key reframing: use diffusion not for token-level observation recovery, but to transport a latent prior. Global semantic organization happens in continuous latent space; local word realization is delegated to a decoder.
CoLa DLM
Continuous Latent Diffusion Language Model. A hierarchical latent-variable language model: Text VAE → block-causal DiT prior → conditional decoder.
From a unified Markov-path perspective, the diffusion process performs latent prior transport rather than token-level observation recovery — separating global semantic organization from local textual realization. The generative model is just:
| p(x) | the probability the model assigns to a piece of text x. |
| z₀ | the continuous latent — a "meaning vector" / global semantic plan. |
| pθ(x|z₀) | the decoder: how likely this meaning is realized as exactly this text. |
| pψ(z₀) | the prior: how plausible that meaning is in the first place. |
| ∫ … dz₀ | marginalize — sum over all possible meanings. |
A new paradigm: hierarchical information decomposition
Cola DLM first learns a stable text↔latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through a conditional decoder. Because block-causal attention is bidirectional within a block but causal across blocks, it keeps cross-block causal structure while allowing efficient parallel computation inside each block.
Parameterize token-level conditionals → a clean training target. But the fixed order forces sequential inference and a strong hand-crafted bias that limits general generation.
Drops left-to-right, but still does observation recovery in discrete token space → costly multi-step sampling, and intermediate discrete states can't stably hold global semantics.
Adds continuous spaces, yet most still use the diffusion path to recover token-aligned representations rather than explicitly model a latent prior.
No prior framework jointly delivers non-autoregressive generation + continuous representation + probabilistic modeling. Cola DLM is built to close exactly that gap ↓
The decomposition weakens the inductive bias of fixed token order, lets the geometry of continuous space directly support compression and prior fitting, and enables a more flexible non-autoregressive generation process. It is also modular — any latent module (AE, RAE, …) and other continuous modalities plug in.
The landscape: where does continuity live?
The single sharpest question for placing any text model is: in which space does the model put its "stochastic path", and what is that path's job? Every prior family answers differently — and every one keeps the path tied to recovering an observation. Cola DLM is the first to move the path entirely into a compressed latent prior.
Diffusion on one-hot vectors / logit-simplices. Representation dim = |V| (vocabulary size, ~10⁴–10⁵). Scales badly.
Map tokens to embedding space ℝL×e, diffuse there. Still recovers token-aligned targets — no explicit latent prior.
Compress to latent z₀ ∈ ℝd via VAE, diffuse there. Cola lives here — but treats the latent as a hierarchical variable with a learned prior, not a fixed code.
The generative model: decoder × prior
Everything begins with two players. Let x ∈ 𝒳 be a discrete text sequence, and z₀ ∈ ℝd its continuous latent variable. Cola's generative model is just a conditional decoder pθ(x | z₀) and a latent prior pψ(z₀).
A discrete token sequence (length ≤ 512 in experiments). The observable.
Continuous latent. d = 16 / 64 / 128 studied. Carries global semantics. Sequence-structured into blocks.
Pure Gaussian noise in ℝd. The flow's starting point at time t=1.
| pψ(z₀) | the prior over latents — "what meanings are plausible." Parameters ψ (the DiT). |
| pθ(x|z₀) | the decoder — "given this meaning, how likely is this exact text." Parameters θ. |
| ∫ … dz₀ | marginalize: sum over all possible latents to get the text likelihood. |
| qφ(z₀|x) | the encoder — used only for variational inference during training; not part of the generative model. |
| p₁ = 𝒩(0,I) | base distribution at time t=1 (pure noise). |
| vψ(zt,t) | a learned vector field — at every point & time it says "which way to flow." |
| Φψ0←1 | the flow map: integrate the ODE from t=1 down to t=0. Turns noise z₁ into a structured latent z₀. |
| z₀(b) | the b-th block of latents (each block = several tokens; block size 16 works best). |
| B | number of blocks in the sequence. |
| z₀(<b) | all earlier blocks — block b is conditioned on its history. |
One objective, three jobs
You can't compute ∫…dz₀ directly, so training maximizes a lower bound — the ELBO. Its real beauty is what it decomposes into: reconstruction, compression, and prior matching. Three analytically separable jobs.
| log p(x) | the true marginal log-likelihood of text x — the intractable target we'd love to maximize. |
| 𝔼qφ(z₀|x) | average over latents drawn from the encoder posterior — the variational helper that guesses z₀ from x. |
| log pθ(x|z₀) | reconstruction term — how well the decoder rebuilds the exact text from the latent. |
| log pψ(z₀) | prior term — how much the DiT/flow prior likes this particular latent. |
| −log qφ(z₀|x) | entropy term — penalizes an over-confident (too-peaked) encoder; keeps the bound honest. |
| ℒELBO(x) | the Evidence Lower BOund itself — what training actually maximizes (=: means "defined as"). |
| 𝔼pdata | average over the data distribution — i.e. the ELBO averaged across all real text, not one sample. |
| q(x,z₀) | the joint pdata(x)·qφ(z₀|x) — sample real text, then encode it to a latent. |
| Iq(X;Z₀) | mutual information between text and latent: how many nats the latent stores about x — the rate / compression cost (subtracted). |
| q̄φ(z₀) | the aggregated posterior ∫ qφ(z₀|x) pdata(x) dx — the marginal cloud of all encoder outputs the prior must fit. |
| KL(·‖·) | Kullback–Leibler divergence — the prior-matching gap between that cloud and the prior pψ. |
How well the decoder rebuilds x from z₀. Want it high.
Bits the latent stores about text. The compression cost — subtracted, so the model is pushed to stay compact.
How far the aggregated posterior q̄φ is from the prior pψ. The DiT's job is to shrink this.
Drag the latent information rate I(X;Z₀). Watch the three terms trade off — and see why Cola needs the data to have low-rate global semantics (Eq 3.5 in motion).
| maxψ | optimize over the prior's parameters ψ (the DiT) — encoder & decoder held fixed. |
| 𝔼z₀∼q̄φ | average over latents sampled from the aggregated posterior — the encoder's output cloud. |
| ⟺ | "is equivalent to" — the two optimizations have the same solution. |
| KL(q̄φ‖pψ) | divergence from that cloud to the prior; minimizing it = making the prior fit the latents the encoder emits. |
| vψ | the network's predicted velocity (conditioned on history blocks). |
| ut | the target velocity of the straight conditional path (the ground truth direction). |
| ‖·‖²₂ | squared error — it's just regression onto a known direction. |
D · APPENDIX A — DERIVATION Why the flow prior has an exact density, and what the ELBO really decomposes into ▸
The main text states Eq 3.1–3.3 and the ELBO as facts. Appendix A is where they are built. Four results matter, and they explain three things the page above only asserts: why a flow can report a likelihood at all, where the ELBO's slack hides, and why Flow Matching is a side-door, not the front door.
The ODE's density obeys the continuity equation $\partial_t p_t(z) + \nabla\!\cdot\!\big(p_t(z)\,v_\psi(z,t)\big)=0$. Following one trajectory, this collapses to the instantaneous change-of-variables formula, then integrates:
Plain English: a normal generative prior you can only sample from. This one hands you the actual probability number: flow the latent forward to noise, read off the Gaussian density where you land $\log p_1(z_1)$, and add the total log-volume the flow stretched along the way $\int \nabla\!\cdot v$. The divergence integral is the "how much did space squeeze" ledger.
Training maximizes $\mathcal{L}_{\mathrm{ELBO}}$, never $\log p(x)$ itself. The gap is precisely how far the encoder $q_\phi$ is from the true posterior $p(z_0\mid x)$ — a fixed, non-negative tax. A poor encoder is a permanent likelihood penalty (this is the variational gap that returns in Appendix D & F).
This is the formal statement behind the budget widget's bar ③. The "target cloud" $\bar q_\phi(z_0)=\int q_\phi(z_0\mid x)\,p_{\text{data}}(x)\,dx$ is the marginal of all encoder outputs; the DiT prior is trained to match exactly it.
One equation, three jobs for the encoder: realize text from the latent, decide how many nats the latent carries $I_q(X;Z_0)$, and set how hard the prior must work. The reason "raise the rate" doesn't monotonically help: it lifts reconstruction but is subtracted and also inflates the KL the prior must close.
Directly maximizing $\log p_\psi(z_0)$ needs ODE solves and divergence estimates every step — expensive. FM instead regresses the velocity field onto the straight bridge paths; its pointwise optimum is the conditional-mean velocity. It learns the same prior $p_\psi$ far more cheaply, but you must not read $\mathcal{L}_{\mathrm{FM}}$ as if it were the negative prior log-likelihood. This distinction is the seed of the "likelihood ≠ generation" story (Appendix F).
The continuous diffusion field, in 3D
Everything Cola does at the prior level is transport. A learned velocity field vψ(zt,t) carries a cloud of Gaussian noise (t=1) down onto a structured, multi-cluster latent manifold (t=0). The arrows are the field itself; the points are latents riding it along straight conditional paths zt=(1−t)z₀+t z₁. Drag to orbit.
The DiT prior is vψ. Flow Matching (Eq 3.7) only fits these arrows so the cloud lands on the right manifold — nothing else.
No token-by-token recovery. A whole global latent is transported at once, then handed to the decoder to realize words.
The target is multi-modal: different global meanings live in different basins. The field routes each seed into one — the geometry RQ1 detects as "global structure."
How do you score a sentence you never wrote directly?
AR models read off log p(x) for free from the token chain. Cola can't — the latent is marginalized. So at evaluation it approximates log p(x) by importance sampling: draw latents from the encoder, weight them, and combine. Two estimators fall out — and one is always tighter.
K latent draws from the encoder for one text x.
A scalar importance weight per sample — high = "this latent explains the text well & the prior likes it."
Latent plus a running log-density accumulator ℓ, integrated by one ODE.
| w(k) | the importance weight of the k-th sample — the ratio that corrects for sampling latents from the encoder instead of the true posterior. |
| z₀(k) | the k-th latent drawn from the encoder qφ(z₀|x) (we average over K of them). |
| log pθ(x|z₀) | decoder log-likelihood — reward if the latent rebuilds the text. |
| log pψ(z₀) | prior log-density (from the CNF, see Eq 3.9) — reward if the prior likes the latent. |
| −log qφ(z₀|x) | subtract the encoder's own log-density — the correction for having sampled from it. |
| zt | the latent state integrated along the ODE from t=0 (data) to t=1 (noise). |
| ℓt | the passenger accumulator — running total of the log-density change; starts at 0. |
| vψ(zt,t) | the learned velocity field driving the flow. |
| ∇·vψ | its divergence — the instantaneous log-volume change (computed via Eq 3.11). |
| log p₁(z₁(k)) | Gaussian base density at the noise endpoint z₁. |
| ℓ₁(k) | the total accumulated volume change at t=1 — add it to get the prior log-density. |
| ∇·vψ | the divergence we need — exactly the trace of the velocity field's Jacobian. |
| Tr(∂vψ/∂zt) | trace of the d×d Jacobian — exact but O(d²) expensive. |
| ε | a random probe vector, ε∼𝒩(0,I); ε⊤Jε is an unbiased one-shot estimate of the trace. |
| ≈ | Hutchinson's stochastic estimator — one cheap vector-Jacobian product instead of the full trace. |
Integrate the flow from t=0 (latent z₀) to t=1 (noise). The passenger ℓt accumulates the divergence ∇·v — i.e. ℓ is the signed area under the ∇·v curve. The amber dots are Hutchinson's one-probe estimates ε⊤Jε: jittery per-step, yet they integrate to the same total. Final answer: log pψ(z₀)=log p₁(z₁)+ℓ₁.
| log p̂ELBO,K | average-of-logs over K weights — the looser bound (the ⟨arithmetic mean of log-weights⟩). |
| log p̂IWAE,K | log-of-average (log-sum-exp) — the tighter bound; equals ELBO when K=1 and climbs toward true log p(x) as K→∞. |
| K | number of importance samples drawn from the encoder. |
| w(k) | the per-sample importance weight from Eq 3.8. |
Draw K importance weights and watch ELBO-style vs IWAE-style estimates. IWAE is always ≥ ELBO (Jensen) and tightens toward the true log p(x) as K grows.
*toy reference. Both estimators are lower bounds; IWAE sits between ELBO and the truth.
| xpre | the prefix / prompt context (already given). |
| xres | the response being scored or ranked. |
| log p(xres|xpre) | the conditional log-probability of the response = log of joint ÷ log of prefix. |
| p̂ (the hats) | plug-in estimators (Eq 3.12) substituted into the exact identity — convenient, but not themselves a certified bound. |
D · APPENDIX B — DERIVATION How you actually compute these log-likelihoods — the augmented ODE, Hutchinson's trace, and why the conditional plug-in is not a bound ▸
Eq 3.8–3.14 give the estimators; Appendix B gives the machinery that makes them runnable and the caveats that make them honest.
Evaluating $\log p_\psi(z_0^{(k)})$ means solving the state and a log-density accumulator together, then estimating the divergence — which is a $d\times d$ Jacobian trace — with the Hutchinson estimator:
Plain English: integrate a passenger $\ell$ alongside the latent that tallies the log-volume change; at $t=1$, $\log p_\psi(z_0)=\log p_1(z_1)+\ell_1$. The exact trace is brutal in high-$d$, so one random projection $\epsilon^\top J\epsilon$ (a single vector-Jacobian product) estimates it — with the same $\epsilon$ frozen across the whole solve so the trajectory stays self-consistent.
Same importance weights $\log w^{(k)}=\log p_\theta(x\mid z_0^{(k)})+\log p_\psi(z_0^{(k)})-\log q_\phi(z_0^{(k)}\mid x)$, two ways to average. Average-then-log (IWAE) beats log-then-average (ELBO) by Jensen, and tightens toward the truth as $K\!\to\!\infty$. Both are still lower bounds — so an ELBO-based PPL is an upper bound on the true perplexity.
The exact identity $\log p(x^{\text{res}}\mid x^{\text{pre}})=\log p(x^{\text{pre}},x^{\text{res}})-\log p(x^{\text{pre}})$ is run with two estimators (Algorithm A.2). Caveat the main text glosses: subtracting two bounds does not inherit a bound property — the difference can land on either side of the truth, so this is a practical estimator, not a certified lower bound. For ranking a single fresh block, the block-level score (B.21–B.22) suffices.
Two training stages, one inference cascade
The elegant probabilistic model of §3.1 is realized as a mechanical cascade: Stage 1 learns a stable text↔latent code with a Text VAE; Stage 2 jointly trains the VAE & the block-causal DiT to learn the final prior; Inference encodes the prefix, generates latent blocks autoregressively, and decodes — with a KV cache.
| qφ(z₀|x) | the encoder — maps text to a latent z₀. |
| z₀ | the per-token continuous latent (Stage 1 does not compress sequence length). |
| pθ(x|z₀) | the decoder — reconstructs text from the latent. |
| x̂ | the reconstruction; training drives x̂ ≈ x. |
| −E log pθ | reconstruction loss — rebuild the text. |
| β·KL | pull the encoder toward a base distribution pbase (regularize the latent–text interface). |
| λmaskLmask | a BERT-style masking loss: forces the encoder to keep semantics instead of letting the decoder memorize surface text. |
The block-causal mechanism — the heart of the DiT
How do you keep causal structure (so generation is well-defined) and parallel efficiency (so it's fast)? The answer is the attention mask. Bidirectional within a block, causal across blocks. This is the geometric meaning of the factorization in Eq 3.3 — and it's the single most important diagram in the paper.
| sg(z₀(<b)) | the clean latent blocks before b, with a stop-gradient — used as fixed history, gradients don't flow back into them here. |
| zt(b) | the current noisy block being denoised at time t. |
Preserves the reconstruction + masking structure so the latent stays meaningful as it evolves.
Learns the block-level conditional prior — the actual diffusion/transport loss.
Pins the live encoder to a frozen reference encoder φref — suppresses latent drift during joint training.
| λVAE, λfm, λref | scalar loss weights balancing the three blocks (autoencoding / flow-matching / anti-drift). |
| β | the KL/rate weight inside the VAE term — controls how strongly the latent is regularized (the rate knob from §3.2). |
| λmask · ℒmask | weight × the masked-reconstruction loss that keeps the latent structured. |
| ℒFM | the block-causal Flow-Matching loss (Eq 3.7) — learns the conditional prior. |
| qφref | a frozen reference encoder; the KL to it penalizes the live encoder for drifting. |
Each block is a base loss × a weight λ. Slide the weights to see the single scalar Cola minimizes get assembled — and what breaks when one term is starved.
Generation: encode the prefix, transport blocks, decode
Generation is "autoregressive in latent space." Encode the prompt into clean latent conditions, then produce the response one latent block at a time — each block is a fresh noise seed transported by the flow under the historical condition — and finally decode everything into words.
| xpre, xres | the prompt prefix and the response to be generated. |
| zpre ∼ qφ | the prefix encoded once into a clean conditioning latent (never re-noised). |
| Φψ0←1(ε; ·) | the prior's flow map — integrates noise ε down to a clean latent, conditioned on the prefix and earlier blocks. |
| ẑ₀(b) · ẑ₀(<b) | the b-th generated latent block, conditioned on all blocks before it (block-causal). |
| ε(b) ∼ 𝒩(0,I) | fresh Gaussian seed for block b. |
| pθ(xres|·) | the decoder, realizing text from prefix + all generated latent blocks. |
One frame to rule them all: paths over state spaces
Every text model can be written as a stochastic process τ=(St) on a state space, with a transition kernel and an emission rule. The real question isn't "who uses diffusion?" — it's what space the path lives in, and whether that path recovers an observation or transports a prior.
| τ = (st) | a whole trajectory of states — the path the model factorizes text over. |
| eΘ(x|τ) | the emission kernel — how the final text is read out from the path. |
| μΘ(ds₀) | the distribution of the initial state. |
| KtΘ(dst|s<t) | the transition kernel advancing the path one step (given the history). |
| PΘ(dτ) | the induced law over whole trajectories = initial × product of transitions. |
| Method | State Space | Path Role | Generative Factorization | Continuity Appears | Explicit Latent? |
|---|---|---|---|---|---|
| AR | Prefix Tokens | Direct Generation | ∏ᵢ p(xᵢ|x<ᵢ) | None | ✗ |
| LLaDA | Discrete Masked Seqs | Observation-Recovery | p(s_T)∏ₜ p(s_{t-1}|s_t) | Discrete token space | ✗ |
| Plaid | Continuous Token-Aligned | Observation-Recovery | p(h_T)∏ₜ p(h_{t-1}|h_t) | Continuous token space | ✗ |
| Cola DLM | Compressed Latent Seqs | Prior-Transport | ∫ p(x|z₀)p(z₀)dz₀ | Latent space | ✓ |
| z₁ ∼ p₁ | a seed from the simple base (Gaussian) distribution. |
| Φψ0←1 | the flow map carrying noise z₁ to a structured latent z₀ (integrates the ODE). |
| vψ(zt,t) | the learned velocity field defining that flow. |
| (·)♯ p₁ | the pushforward — the prior pψ is exactly the base distribution carried through the flow. |
| pθ(x|z₀) | the decoder, which finally realizes text — the only place an observation x appears. |
| 𝔼q[log pθ(x|z₀)] | reconstruction — how well the decoder realizes text from the latent. |
| Iq(X;Z₀) | mutual information = bits of global semantics compressed into z₀ (subtracted). |
| KL(q̄φ‖pψ) | prior-matching gap between the aggregated posterior and the flow prior. |
D · APPENDIX C — DERIVATION The four families as one Markov process — and the exact identity that says why a better prior helps ▸
Appendix C builds the abstract "process-based generative model" the table above summarizes, then asks one sharp question per family: into what state space, along what path, with the path doing observation-recovery or prior-transport?
Set states $S_i:=x_{1:i}$. Then $(S_i)$ is a Markov chain whose one-step kernel is the AR conditional $p_\eta(x_i\mid x_{<i})$. Its true inductive bias isn't Markovianity — it's that conditioning is locked to the growing prefix $\sigma(X_{1:1})\subset\dots\subset\sigma(X_{1:L})$. Exact token likelihood, but a frozen left-to-right order.
LLaDA's masking is a continuous-time Markov chain that absorbs each token into a mask state with probability $t$ (C.16) — a reverse recovery over discrete states. Plaid does the same in a continuous token-aligned space $h_0=\mathrm{Embed}(x)$; as noise $\to 0$ its state stays glued to the observation (C.17). In the $\sigma_0^2\!\to\!0$ limit (C.19), Cola would degenerate to Plaid — which pinpoints the genuinely new ingredient: the latent decomposition itself, not the continuity.
The flow $z_1\!\sim\!p_1,\ z_0=\Phi^\psi_{0\leftarrow1}(z_1)$ never sees $x$. The encoder $q_\phi$ appears only in the variational bound (C.22), so it belongs to inference; in Plaid/LLaDA the forward process is part of the model definition. That is the precise sense in which Cola is "first and foremost a hierarchical latent-variable LM with a CNF prior, where flow is just a way to make the prior family expressive."
For any two candidate priors $p_a,p_b$, the average-ELBO gain of swapping $a\!\to\!b$ is exactly the reduction in KL-to-aggregated-posterior. So whenever the flow/CNF prior sits closer to $\bar q_\phi$ than a plain Gaussian does, the average ELBO provably rises. "Why diffusion" is not about escaping max-likelihood — it is about buying a more expressive prior family that closes this KL.
When does the latent bottleneck help?
The paper is refreshingly honest: diffusion and continuity guarantee nothing. Cola DLM wins only when the data has a specific shape — low-dimensional global semantics + high-dimensional local realization. This is made precise by a unified statistical-burden criterion and three governing curves.
| ℰ(ℳ) | approximation error: the best the model family could ever do — irreducible mismatch with the truth. |
| Ginfer | inference gap: extra cost from using a variational bound (the encoder's imperfection). AR has none of this. |
| R | total burden = how wrong the family is + how lossy its training objective is. |
| ≻ | "is better than" at the population level — i.e. lower total statistical burden. |
| RColaDLM | Cola's total burden = approximation error + inference gap (from Eq 3.29–3.31). |
| ℰ(ℳAR) | AR's only cost — its approximation error (AR reads exact likelihood, so its inference gap is 0). |
| ⟺ | "if and only if" — a strict, falsifiable condition, not a heuristic. |
Cola wins iff its total burden ℰ(ℳCola)+𝒢infer falls below AR's ℰ(ℳAR). Tune the three costs; the white line is AR's bar — Cola must finish left of it.
Drag the data's rate-distortion curve D(R) and the two other knobs. The verdict lights up only when all three conditions hold simultaneously.
| G, p⋆(g) | a hypothesized global factor (topic, plan, style) and its true distribution. |
| p⋆(x|g) | the true mechanism that realizes the global factor into concrete text. |
| H(X|G) ≪ H(X) | knowing G removes most of the text's uncertainty — i.e. G is highly informative. |
| dim(G) ≪ dim(E(X)) | and G is low-dimensional vs the full embedded text — a cheap summary. |
D · APPENDIX D — DERIVATION The "three curves" made rigorous — statistical burden, the rate-distortion bound, and exactly when the bottleneck backfires ▸
The page asserts "Cola wins only when the data has a certain shape." Appendix D turns that into inequalities you can check, via a single population-level accounting.
Every family's population risk is $H(p_{\text{data}})+\text{model mismatch}+\text{objective gap}$. Cola pays an extra inference gap $\mathcal{G}^{\text{infer}}=\E\,\KL(q_\phi\|p_{\theta,\psi}(z_0\mid x))\ge 0$ that AR (exact NLL) never pays. So the clean verdict (D.9): Cola beats AR iff $\mathfrak{R}_{\text{Cola}}<\mathfrak{R}_{\text{AR}}$ — superiority is never automatic from "more machinery."
The mutual-information identity $H_q(X\mid Z_0)=H(p_{\text{data}})-I_q(X;Z_0)$ turns the reconstruction floor into a rate problem:
Plain English: $\mathcal{D}(R)$ is the best reconstruction you can buy if the latent is allowed at most $R$ nats. If it drops fast at small $R$, a cheap global summary exists → the bottleneck helps. If you only get reconstruction near $R\!\approx\!H(X)$, the data is near-incompressible → the bottleneck is pure overhead. This is the slider in the widget above.
If $p_{\text{data}}(x)=\int p^\star(x\mid g)p^\star(g)\,dg$ with $H(X\mid G)\ll H(X)$ and $\dim G\ll \dim E(X)$, Cola's inductive bias is the data's structure: it splits one hard problem into "learn $p^\star(g)$ + learn $p^\star(x\mid g)$" (D.17). Where that fails, the three explicit costs (D.18) — inference gap, the elevated reconstruction floor $H(X\mid Z_0)$ from the bottleneck, and joint-training complexity — dominate. And the variational gap is always present (D.19): $\log p(x)-\mathcal{L}_{\mathrm{ELBO}}=\KL(q_\phi\|p(z_0\mid x))$. Success is a competition among three curves — $\mathcal{D}(R)$, prior-approximation, and the inference gap — and only when all three favor Cola is the decomposition a real advantage.
Catching invisible structure with a timeshift
≈2B total — matched against AR (LLaMA) & LLaDA with ~1.8B non-embedding backbones.
LR 1e-6 → warm to 1.5e-4 (5k steps) → cosine to 1e-5 by 1M steps. No EMA. Seq len 512.
Strict string-match accuracy across multiple-choice & generative tasks — because PPL ≠ quality (§5.1).
LAMBADA, MMLU, SIQA (internal) + SQuAD, Story Cloze, OBQA, RACE, HellaSwag (external).
"If the latent representation is purely local and fully separable, then the optimal timeshift does not drift as the latent dimension changes. Therefore, if the optimal timeshift is observed to shift systematically with the latent dimension, this indicates the existence of cross-dimensional shared structures — and if it shows up in semantic metrics, those structures relate to high-level semantics."
| d=16 | d=64 | d=128 |
|---|
Best loc for Task Avg shifts 1.0 → 1.7 → 2.3 as d = 16 → 64 → 128. Clear, near-monotonic. Directly contradicts the separable null hypothesis.
LAMBADA, MMLU, SIQA & Task Avg all favor larger loc at higher d. Not a single-task fluke — a structure shared across semantic tasks.
Empirical peaks sit close to the Appendix-E predicted positions (dashed lines), drift directions fully consistent. Not a hyperparameter accident.
D · APPENDIX E — DERIVATION The falsifiable trap, the proof that drift refutes it, and where the $\delta^\star(d)=a\log d+b$ law comes from ▸
Implication 1 is a contrapositive — to wield it you need (a) a precisely stated null hypothesis, (b) a theorem that the null forbids drift, and (c) a structural model that predicts the shape of the drift when the null fails. Appendix E supplies all three.
Suppose the objective decomposes additively over independent, identically-behaving latent dimensions, with a shift-response of identical functional form:
That is the formal meaning of "no shared structure": dimension $d$ only rescales/offsets the same per-dimension curve $j(\delta)$.
Plain English: a positive rescale $a_d$ and a constant offset $b_d$ never move the location of a maximum. So if the latent were truly separable, the best timeshift would be pinned across $d$. The contrapositive (Cor E.3): an observed stable, monotonic, reproducible drift — not explainable by parameter count, under-training, or noise — rejects the null. The drift in the widget above is that rejection.
Write the forward process $z_t=\alpha_t z+\sigma_t\epsilon$ and decompose the latent into a semantic signal plus residual, $z=\phi(s)+u$. Then what reaches the DiT is $z_t=\alpha_t\phi(s)+(\alpha_t u+\sigma_t\epsilon)$ — so what matters is not the raw timestep but how much information about $s$ survives. Under the separable null, $I(s;z_t)=\sum_i I(s_i;z_{t,i})$ and varying $d$ only rescales it — again no shift in the optimal regime.
Let many dimensions observe one low-dimensional shared factor, $z_i=A_i g+\xi_i$. Standard linear-Gaussian inference then gives a recovery SNR that grows with $d$, and a recoverable-information that grows logarithmically:
More dimensions watching the same factor ⇒ stronger effective SNR ⇒ the shift must compensate logarithmically to keep training in the same semantic-recovery regime. This is the dashed "Appendix-E prediction" line the widget plots — and it is structurally homologous to the resolution-dependent timestep shift in Stable Diffusion (Remark E.4).
Even at fixed $d$, lowering the VAE logSNR raises posterior variance $\Sigma_u$, so the total noise seen by the semantic variable is $\Sigma_{\text{noise}}(t)=\alpha_t^2\Sigma_u+\sigma_t^2 I$. A smoother latent ⇒ the same raw timestep corresponds to a lower effective semantic SNR ⇒ the shift must be recalibrated. Latent dimension and VAE logSNR look like two different knobs but act on one object: the effective mutual-information curve $I(s;z_t)$ along diffusion time. (This is the bridge to the noise-schedule deep-dive on the RQ3 page.)
The latent should evolve — from a stable start
Three sub-questions: should the latent be fixed or evolving? What dimensionality? How much semantic smoothness? The headline: neither frozen nor trained-from-scratch — let it co-evolve with the DiT on top of a good initialization, keep it semantically smooth (BERT loss + learnable logSNR), and bigger latent dims carry more semantics.
Larger latent dims raise the overall average — more semantic capacity.
| Method | Lambada | MMLU | SIQA | Avg |
|---|---|---|---|---|
| d=16 | 14.3 | 6.9 | 4.9 | 8.7 |
| d=64 | 20.9 | 5.4 | 7.6 | 11.3 |
| d=128 | 18.5 | 8.1 | 8.9 | 11.8 |
Learnable logSNR (≈4.5) wins; fixed 1.5 is the best fixed alternative.
| logSNR | 77.86 EF | 116.78 EF | ||
|---|---|---|---|---|
| SIQA | Avg | SIQA | Avg | |
| Fixed 1.0 | 11.3 | 14.7 | 18.4 | 18.8 |
| Fixed 1.5 | 17.5 | 18.3 | 23.6 | 21.8 |
| Fixed 2.0 | 14.3 | 16.8 | 19.5 | 20.6 |
| Learnable | 16.2 | 18.9 | 21.6 | 22.1 |
Tuning the denoiser: block size, schedule, steps, guidance
Four knobs decide how good the prior gets. The winning recipe: block size 16, noise schedule loc=1.0, ~10–32 denoising steps, and a moderate CFG ≈7. Every one of these is a "Goldilocks" — too little or too much hurts.
Drag denoising steps (saturating gain) & CFG scale (inverted-U). The dashed line is the paper's reference Task Average.
"If the schedule location shifts the logSNR curve, then it also shifts the effective semantic-information regime the DiT sees during denoising. The best noise schedule is the one whose logSNR trajectory is best aligned with the latent space and the semantic scale to be recovered — not a universally fixed timestep parameterization."
D · APPENDICES G, H.7 & H.9 — DERIVATION What "timeshift" really is: noise-schedule ⟺ logSNR, the two ways logSNR enters the FM loss, and the LogitNormal knob ▸
Implication 2 says "the schedule calibrates the semantic-information regime." Appendix G proves the schedule is not an external hyperparameter at all — it is baked into the training geometry — and H.7/H.9 give the exact quantities the experiment dials.
Specifying $\lambda(t)$ fixes $(\alpha_t,\sigma_t)$ and vice-versa. A timeshift $\lambda_\delta(t)=\lambda(t)+\delta$ therefore doesn't reweight a loss after the fact — it re-maps the same raw timestep to a different logSNR interval.
Change variables $t\to\lambda$ in the FM objective. The uniform-$t$ measure pushes forward to a non-uniform measure on the logSNR axis, and the supervised target velocity rescales:
Plain English: shifting the schedule changes (i) which noise regimes get sampled most, and (ii) how hard the regression target is in each regime. So uniform-timestep training is not equivalent to uniform-logSNR training unless $\lambda(t)$ is affine (Prop G.1). The schedule is part of the objective, not a knob bolted on top.
The schedule controls the curve $t\mapsto I(s;z_t)$. Choosing the timeshift is therefore an effective-semantic-information calibration problem — it depends jointly on latent dimension $d$, posterior uncertainty $\Sigma_u$, latent geometry $\mathcal{G}$, and block size $B$. (Remark G.3: block size has no closed-form law but couples to the schedule through the same curve — which is why block 16 and loc 1.0 are co-selected.)
The VAE logSNR (H.7) is the signal-to-noise of the encoder posterior — larger ⇒ cleaner, more deterministic latent. The timestep shift is implemented as a LogitNormal sampler: $s=t/T\sim\mathrm{LogitNormal}(\mu,\sigma^2)$, density $p(s)=\frac{1}{\sigma\sqrt{2\pi}}\frac{1}{s(1-s)}\exp\!\big(-\frac{(\log\frac{s}{1-s}-\mu)^2}{2\sigma^2}\big)$ (H.9). Larger $\mu$ pushes sampling mass toward later timesteps; $\sigma$ controls how concentrated it is. That is precisely the "loc" the widget sliders and Fig 17 visualize — a reshaping of which logSNR regime is emphasized, not a numeric reweighting.
Does it scale? Against matched AR & LLaDA — yes
The decisive test. Under the best config (d=16, block 16, joint training lr-ratio 1, BERT loss, loc=1, 16 steps, CFG=7), Cola DLM is compared to strictly-matched AR (LLaMA) and LLaDA — both with 1.8B non-embedding backbones, same data, up to ~2000 EFLOPs. The result: strong, persistent scaling, best final Task Average.
On Task Average, Cola improves steadily and reaches the best final. AR competitive at small budgets; Cola rises more persistently into the high-compute regime.
On MMLU, RACE, Story Cloze, OBQA, a strong upward trend and best/near-best performance — exactly the tasks needing global semantic organization.
On LAMBADA, tracks AR closely. On SQuAD, a clear gain with scale — eventually surpasses AR and approaches LLaDA's strong region.
This is a restrained config (d=16). RQ2 showed d→128 adds capacity; logSNR analysis shows more headroom. The real ceiling is higher than shown.
Why perplexity lies about a latent model
A central, counter-intuitive phenomenon: generation can already be good while likelihood-oriented PPL stays terrible. They measure different things. Generation only needs the prior's mass to reach semantically decoder-valid regions. Likelihood additionally needs accurate local density calibration right around the gold posterior.
| xres | the response being scored. |
| c | the conditioning context induced by the prefix/prompt. |
| pθ(xres|z,c) | decoder likelihood of the response given a latent and context. |
| pψ(z|c) | the conditional prior over latents given context. |
| ∫ … dz | marginalize over all latents — the exact conditional probability (what generation needs). |
| 𝒮resp(x) | the accessible local score — the ELBO-style / PPL proxy actually evaluated. |
| 𝔼qφ(z|x,c) | averaged only over the encoder posterior for the gold text — a narrow neighborhood. |
| log pθ(xres|z,c) | decoder term (reconstruction of the gold response). |
| log pψ(z|c) − log qφ | prior minus encoder — the local calibration term that PPL is sensitive to. |
Move the prior around the latent plane. Generation (Eq 5.1) only needs prior mass to land anywhere in the broad decoder-good region. PPL (Eq 5.2) needs prior density piled on the narrow gold tube. Watch the two metrics disagree.
"Good generation & good likelihood-oriented estimation are not equivalent. Generation depends on whether the prior reaches semantically valid latent regions; likelihood additionally depends on local density calibration around the gold posterior neighborhood."
"Generation quality relates to semantic smoothness of the latent space; likelihood-oriented PPL is more sensitive to probability-space smoothness shaped by the VAE logSNR. These two smoothnesses differ → generation and PPL need not align."
For the token "at", likelihood-derived PPL improves dramatically 1.15×10⁶ → 641.57 → 245.36 across logSNR settings — yet the generated token degrades from a sensible "on" to a comma. For "her", smaller PPL under fixed logSNR fails to recover the correct token. Direct training has much worse PPL but sometimes preserves the right semantic behavior. Flatter logSNR smooths the density (better PPL) but blurs semantics toward generic words like "in/the/went".
D · APPENDIX F — DERIVATION Why a continuous latent LM can generate well yet score terrible PPL — the four theorems behind the paradox ▸
The four implications above are proved in Appendix F, by separating two geometric objects: a broad "decoder-good region" that generation needs to reach, and a narrow "gold tube" that PPL needs to calibrate.
The squared FM loss has a unique optimum: the conditional-mean velocity. When the conditional response distribution is multimodal or broad-peaked, FM learns an average transport into a reasonable region — it never promises local density calibration around any one gold sample. That is the root cause.
If the context admits several valid continuations, the prior's global mean sits between the modes — far from the gold latent the posterior selected. Generation is still fine as long as the modes' mass lands in a decoder-good region.
Plain English: generation only needs the prior to drop an $\alpha$-fraction of mass somewhere in the big good region — a coverage requirement. Conditional PPL needs the prior to put high local density on the thin gold tube of one specific response — a calibration requirement. Prop F.3 shows both can hold at once: good samples, yet $\mathcal{S}_{\text{resp}}\le B-\Delta$, an arbitrarily biased PPL.
Good reconstruction $R\to R_{\max}$ does not imply good PPL if the KL gap stays positive (Prop F.4). And even if the centers align ($\mu_p\approx\mu_q$), the scale/orientation/volume terms in the Gaussian KL keep PPL poor (Prop F.5) — a too-sharp posterior amplifies this. AR is immune because training = the object PPL evaluates = the object generation uses (F.25): $-\log p^{\text{AR}}=\sum_i-\log p(x_i\mid x_{<i})$. Continuous latents add latent-integration, posterior–prior matching, and decoder compatibility on top — which is why PPL behaves like a density-calibration metric, not a generation-quality metric.
The tricky first block: known prompt + unknown words
The very first generation block is mixed — it holds known prompt latents and latents to be generated. How you treat the known part decides everything. Four strategies were tried; one dominates: keep the known region clean and fixed throughout denoising.
| Task | Repaint t=1 (m=1/.7/.3) | Repaint t=3 (m=1/.7/.3) | Clean cond. | Left pad | Right pad |
|---|---|---|---|---|---|
| Lambada | 8.5/8.5/6.6 | 7.0/7.3/5.6 | 37.1 | 24.6 | 24.7 |
| MMLU | 7.9/7.9/7.8 | 7.6/6.7/7.0 | 11.9 | 8.4 | 11.5 |
| SIQA | 8.8/8.7/8.2 | 13.3/13.0/12.0 | 24.8 | 14.9 | 13.8 |
| Avg | 8.4/8.4/7.5 | 9.3/9.0/8.2 | 24.6 | 16.0 | 16.7 |
D · APPENDIX I.1 — DERIVATION A Flow-Matching account of why "clean condition" wins — the role mismatch, the variance gap, and the error-accumulation bound ▸
Table 5 is a leaderboard; Appendix I.1 explains why the order is what it is — from the structure of conditional Flow Matching in the first block, where a known prompt region and an unknown region coexist.
Cola's prior is learned block-by-block as a conditional flow $p_\psi(z_0)=p_\psi(z_0^{(1)})\prod_{b\ge2}p_\psi(z_0^{(b)}\mid z_0^{(<b)})$, predicting a noisy block under clean historical conditions. Decompose the first block as $z^{(1)}=(z_K,z_U)$ (known / unknown). The mathematically correct task is:
The known part is a boundary condition, only the unknown part is transported by the flow.
Clean conditioning solves the transport under exactly the intended condition $(z_{\text{pre}},z_K)$. Partial repaint replaces the true known region with a degraded, time-varying surrogate $\tilde z_{K,t}$ — a different, noisier regression target.
Plain English: a noisier condition is compatible with more clean targets, so the irreducible variance of the velocity regression rises. Worse, it's a role mismatch: in Cola the flow path is for prior transport, and historical conditions are supposed to be stable anchors. Partial repaint demotes the known region from "fixed condition" to "partially-denoised state variable" — it changes the task from transport the unknown under a fixed condition to jointly maintain a noisy known part and transport the unknown.
Because inference integrates the learned field along an ODE, the condition-induced bias $\delta$ accumulates over the trajectory ($L$ = Lipschitz constant). This is why reducing the guided fraction $m$ hurts (more unguided interval = more accumulation) and why more repaint cycles $t$ don't help (repeated early corrections can't turn a transient condition into a persistent one). Left/right padding never re-noises the known region, so it avoids the worst failure — but it only rearranges layout, never locks the condition exactly, and it complicates the block-causal attention pattern. Hence the strict ordering: clean cond ≫ padding ≫ partial repaint, exactly as the table shows.
Can you compress the latent? Yes — if you align boundaries
Two VAEs are compared at d=128: p1 maps each token to one latent; p2 compresses every two tokens into one. Overall p2 looks worse — but the whole gap comes from odd-length prompts. On even lengths, p2 actually wins.
| Overall | Mod0 (even) | Mod1 (odd) | ||||
|---|---|---|---|---|---|---|
| p1 | p2 | p1 | p2 | p1 | p2 | |
| Lambada | 31.1 | 17.4 | 32.1 | 34.6 | 30.1 | 0.8 |
| MMLU | 5.4 | 3.9 | 6.9 | 7.7 | 3.9 | 0.0 |
| SIQA | 11.1 | 6.1 | 12.9 | 12.1 | 9.3 | 0.0 |
| Avg | 15.9 | 9.1 | 17.3 | 18.1 | 14.4 | 0.3 |
"The weakness of patch size 2 does not mainly come from compression itself, but from the boundary case where the prompt length is not divisible by the patch size. Once the latent grouping is well aligned with the text sequence, compression can instead become beneficial."
Drag the diffusion noise. The VAE reconstructs near-perfectly at t=0 (acc 0.9998) and degrades gracefully — semantics aren't destroyed by small/moderate perturbations.
A bridge from text to a shared continuous mind
Because Cola already maps discrete text into a continuous latent, it offers a natural bridge to other continuous modalities. Map each modality to its own latent, then let a single block-causal MMDiT prior organize the joint semantics, while modality-specific decoders handle realization. Continuity enters at the level of the prior, not the pixels or tokens.
| z₀text, z₀img | per-modality latents from separate encoders qφtext, qφimg. |
| z̃₀ | the concatenated joint latent the shared prior organizes. |
| pθ(xtext,ximg|z̃₀) | modality-specific decoders realizing each surface from the joint latent. |
| pψ(z̃₀) | one shared MMDiT prior over the joint latent — where cross-modal semantics live. |
| 𝔼q[log pθ(·|z̃₀)] | joint reconstruction of both modalities from the shared latent. |
| I((Xtext,Ximg);Z̃₀) | information the joint latent stores about both observations — the shared compression rate. |
| KL(q̄(z̃₀)‖pψ) | prior-matching gap for the joint aggregated posterior — same role as Eq 3.5, now multimodal. |
- Scale. A controlled-scale study — the true ceiling under bigger models, longer training, more compute is untested.
- Design. VAE strategy, compression, latent dim, smoothness, joint logSNR / block size / schedule all matter; stronger latents likely need better noise calibration.
- Framework. The value is the decomposition, not denoising. Opens doors to stronger latent modules (AE, RAE) & flexible prior learning (drifting-model distribution matching), and to more modalities.
Cola DLM decomposes text generation into global semantic prior modeling in latent space + local textual realization via conditional decoding — a principled alternative to strictly token-level LM. The study consistently finds: evidence of shared global semantic structure, effective design choices for latent & diffusion, strong generation quality and encouraging scaling. For this model class, generation quality & scaling trends are more informative than likelihood alone — and the continuous latent offers a concrete path to unified multimodal modeling.
The bigger picture: representation, objective, environment
The afterword zooms out. Learning is never about model structure alone — it's a model–environment interaction system shaped by three jointly-coupled things: how you represent text, what objective you optimize, and what environment you learn in. AR occupies just one self-consistent corner of that design space.
| Ω,𝒪,𝒜 | state / observation / action spaces. |
| 𝒯,ℱ,𝒢 | transition / feedback / gradient rules. |
| 𝒥 | discounted return over an interaction trajectory τ. |
The path no longer acts on observation recovery — it organizes global semantics in a latent state first, then the decoder does local realization. The role of "state" is redefined.
Even at the ELBO, the objective is separated from true likelihood by a variational gap. So a PPL mismatch isn't failure — the model is learning something different. Scaling behavior beats any single likelihood number.
Real environments are non-separable across modalities — useful feedback depends on joint regularities. So unified models matter not for one backbone, but to learn in an environment that couples modalities. Text needs a continuous interface (Eq 8.18) to join.
The proofs underneath, in plain sight
Four results make the whole story rigorous: (A) the CNF prior has an explicit log-density; (A) Flow Matching is a solver, not the model; (A/C) the average ELBO decomposes into three information-theoretic roles; (D) a rate-distortion curve decides when the bottleneck is worth it.
A single (z₀, z₁) pair interpolates along zt=(1−α)z₀+αz₁. Flow Matching regresses the network's velocity onto the target ut=α̇(z₁−z₀). Bend the α-schedule and watch the path & speed change.
The whole paper in one breath
Cola DLM stops treating language as a left-to-right token chain and starts treating it as global meaning (a continuous latent, transported from noise by a block-causal flow prior) + local wording (a conditional decoder). It is honest about when this wins — only when data has low-rate global semantics — and proves it does, empirically (timeshift drift, scaling) and theoretically (three governing curves). Along the way it shows perplexity lies about latent models, and opens a clean bridge to unified multimodal generation.
VAE → block-causal DiT prior → conditional decoder. p(x)=∫pθ(x|z₀)pψ(z₀)dz₀.
Diffusion transports a prior, not an observation. Flow Matching is just the solver.
Global structure exists (RQ1); evolve the latent (RQ2); block 16 + loc 1 + CFG 7 (RQ3); best scaling (RQ4).
For latent LMs, generation quality & scaling — not perplexity — reflect true capability.