Level 4 ·●●○

Compounding Dreams and Other Nightmares

Why dreams drift — and every way your model fails, on demand.

Prerequisites: Level 3 — its trained model is this level's lab bench.

In Level 3 you watched your dream and reality part company, and the site promised an explanation. Here it is, and it's almost insultingly simple: the model eats its own errors. A one-step prediction is wrong by a hair — the verification script measures 0.001419 world units for this model, seeded. Harmless. But step two is computed from step one, so it starts a hair off and adds its own hair. By step 60 the seeded measurement is 0.22449 — roughly 98× the single-step error, and visibly a different world. Nobody made a big mistake; a thousand small ones compounded.

That is the honest headline over this entire field. Everything else on this page — the failure lab, the debate, the open problems — is that sentence wearing different costumes.

The failure lab

Don't take failure modes on faith either. Below is the Level 3 training pipeline with three sliders exposed. Each preset reproduces a failure the field argues about, on demand, in your browser — and the divergence meter recomputes from seeded rollouts, so the damage is measured, not narrated.

On the bench: the same model class from Level 3. (No trained model found in this browser — the lab trains fresh seeded copies with your chosen settings; visiting Level 3 first makes this page richer.)

The failure lab — break your model on purpose

horizon 20training data 8000observation noise σ 0.000retraining…

Plenty of clean data, modest horizon: the dream tracks reality closely. This is the regime every demo video is filmed in.

The debate: are video generators world models?

The field's loudest argument, presented at full strength on both sides — then left, deliberately, unresolved.

No — rendering is not simulating.

The taxonomy camp's argument: a video model produces what a viewer would see, not what is. The drone shot through the canyon looks like physics, but nothing inside the model holds where the walls are or what your steering would change. A renderer without state is a beautiful surface. Prediction of appearances is not prediction of consequences — and agents need consequences.

A Functional Taxonomy of World Models — Fei-Fei Li & World Labs, Substack, June 3, 2026

Careful — continuation may be enough.

The counterargument, in its strongest form: predicting how signals continue may buy the behavior without any explicit inner model. A system that learns to continue sequences picks up syntax without grammar rules, melody without music theory — and perhaps 'physics' without a physics engine. Demanding an explicit inner world may be projecting how we think it must work onto systems that found another way.

Note responding to the world-model taxonomy — Elan Barenholtz, Substack, 2026

No verdict is imposed here. The honest state of the argument: one side says agents need action-conditioned consequences, not appearances; the other says the burden of proof on 'explicit inner models' is heavier than it looks. You trained a model on this site — you have the tools to hold the question.

How the field keeps score now

The evaluation culture is shifting from open-loop to closed-loop: not 'does the generated video look right?' but 'does an agent that uses this model perform better?' Benchmarks like World-in-World and WorldArena put world models inside an acting loop and score the outcome — pretty pixels stop counting, task success starts. The bar is becoming utility, and that is a very different bar.

You already know why this bar is the right one — you used it. In Stage 4 the question was never “does the dream look right?” It was “does the paddle that plans inside the dream still hit the ball?” Closed-loop utility is that question, scaled up.

Source: Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026

Open problems, honestly listed

Five things nobody has solved, stated without varnish.

Long-horizon coherence
Dreams still fall apart with time. Keeping a generated world consistent over minutes — object permanence, stable layout, causes that keep their effects — remains the field's most visible failure; you watched a miniature version of it in Level 3.
Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026
Memory
What a world model saw a thousand steps ago should still constrain what it predicts now. Persistent, queryable memory of a world — not just a long context — is unsolved.
Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026
Physical realism
Plausible-looking is not physically right. Even flagship systems ship with documented physics violations — NVIDIA's own Cosmos report lists object impermanence and gravity errors.
Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026 · Cosmos World Foundation Model Platform for Physical AI — NVIDIA, arXiv:2501.03575, 2025
Evaluation comparability
Every lab reports its own numbers on its own tasks. Closed-loop benchmarks are young, and comparing a driving world model to a Minecraft one to a terminal one is barely meaningful yet.
Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026
Sim-to-real
A policy trained inside a dream inherits the dream's errors. Crossing the gap from imagined practice to real-world competence — without the real world punishing the difference — is the oldest problem here and still open.
Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026

The failure lab

The debate: are video generators world models?

How the field keeps score now

Open problems, honestly listed

Long-horizon coherence

Memory

Physical realism

Evaluation comparability

Sim-to-real