Dream Machine

Level 5 ·○○

The 2026 Landscape

Renderers, simulators, planners — who is building what, sourced.

Prerequisites: None to skim; far richer after Levels 1–4.

You've trained a world model, watched it dream, watched it fail, and learned why. Now — and only now — the systems with the press releases. This level is a field guide, and it obeys the house rule harder than any other page: every claim below lives in a JSON file, carries a source id, and a script fails the build if one doesn't. No claim here comes from vibes.

Three jobs, not one thing

Li's functional taxonomy sorts world models by what they do rather than how they're built: renderers produce what a viewer would see, simulators predict how a world evolves as things act in it, and planners use those predictions to choose actions.

Li's test case is a drone shot gliding through a canyon: a video model can produce the footage — every frame plausible, the whole thing beautiful — without anything inside it knowing where the canyon walls are, or what would happen if the drone banked left. It rendered the view; it did not simulate the world.

You already own one of each. Your Level 3 network is a tiny simulator; the MPC that imagined twenty futures per move is a planner; and the canvas that painted the dream so you could watch it is a humble renderer. The billion-dollar versions below split along exactly these lines.

Source: A Functional Taxonomy of World Models — Fei-Fei Li & World Labs, Substack, June 3, 2026

The explorer

Nine entries, three filters, one honest limitation each. Try filtering by action-conditioned first — it is, per one survey's framing, the most clarifying single column in the whole field: can what you do change what it predicts?

Framings: World Models, From Zero to Hero — HackMD, 2026 · World Models, Architectures, and the Next Phase of AI — Ken Huang, Substack, May 2026

function
substrate
action-conditioned?

9 of 9 systems shown · data: content/landscape.json · last updated 2026-07-02

  • Genie 3

    action-conditioned

    Google DeepMind · 2025

    renderersimulatorpixels / video

    Real-time interactive world generation: playable generated worlds at roughly 24 frames per second at 720p, holding consistent over minutes rather than seconds.

    For: Interactive environments generated on demand — a step toward worlds you can act in that never existed as assets.

    Honest limitation: Closed: available only as a limited API as of 2026 — you can read about it far more easily than you can touch it.

    Genie 3: A new frontier for world models — Google DeepMind blog, 2025

  • Sora / Veo-class video models

    not action-conditioned

    Various labs · 2024–2026

    rendererpixels / video

    Text-to-video generators — the 'video generation' camp of the world-model debate. They produce strikingly plausible footage of worlds.

    For: Producing views of imagined scenes; the open question is whether that footage implies any inner model of the scene at all.

    Honest limitation: The renderer critique applies in full: they produce what a viewer would see, not what is — and your actions can't change what happens next.

    A Functional Taxonomy of World Models — Fei-Fei Li & World Labs, Substack, June 3, 2026 · World Models, From Zero to Hero — HackMD, 2026

  • Marble

    not action-conditioned

    World Labs · Launched commercially November 2025

    renderer3D scenes

    Persistent 3D scene generation from text or images, exporting Gaussian splats and meshes — the flagship of the spatial-intelligence camp.

    For: Making places: coherent, revisitable 3D scenes you can move a camera through and export into standard pipelines.

    Honest limitation: It generates persistent scenes to look at and export — not an action-conditioned simulator of things happening in them.

    Marble — World Labs, launched commercially November 2025 · World Models, From Zero to Hero — HackMD, 2026

  • Cosmos

    action-conditioned

    NVIDIA · 2025

    simulatorpixels / video

    A world foundation model platform for 'physical AI' — open and self-hostable, aimed at robotics and autonomous-vehicle development.

    For: Infrastructure: a pretrained world model other teams fine-tune for their robots and vehicles, plus synthetic data generation.

    Honest limitation: NVIDIA's own report acknowledges failures of object permanence and violations of gravity in generated worlds.

    Cosmos World Foundation Model Platform for Physical AI — NVIDIA, arXiv:2501.03575, 2025

  • Dreamer 4

    action-conditioned

    Hafner et al. · 2025

    simulatorplannerlatent space

    The current generation of the Dreamer line: agents optimizing their behavior inside a scalable learned simulator.

    For: The agent-centric recipe at scale — learn the world, then train the policy in imagination instead of by expensive real trial-and-error.

    Honest limitation: The world model exists to serve the agent's task; it is not a general-purpose world you can wander.

    Training Agents Inside of Scalable World Models (Dreamer 4) — Hafner et al., 2025

  • V-JEPA 2

    action-conditioned

    Meta AI · 2025

    simulatorplannerlatent space

    Self-supervised video models that predict in representation space rather than pixels; the action-conditioned V-JEPA 2-AC variant plans for robot manipulation.

    For: The latent-prediction bet: understanding, prediction, and planning without ever paying the cost of reconstructing pixels.

    Honest limitation: Its imagination is a trajectory of abstract representations — there is no video of the dream to watch.

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning — Meta AI, arXiv:2506.09985, 2025

  • GAIA-2 / GAIA-3

    action-conditioned

    Wayve · 2024–2026

    renderersimulatorpixels / video

    Generative world models for driving: synthesizing realistic driving scenarios, positioned as offline evaluation infrastructure.

    For: Testing driving policies against rare and dangerous situations in generated worlds instead of waiting to meet them on real roads.

    Honest limitation: Positioned for offline evaluation — infrastructure for testing drivers, not the driver itself.

    GAIA-2 / GAIA-3 generative world models for driving — Wayve, 2024–2026

  • Qwen-AgentWorld

    action-conditioned

    Qwen Team · June 24, 2026

    simulatorlanguage

    A native language world model: it simulates seven agent domains (MCP tools, search, terminal, software engineering, web, OS, Android) by predicting the next environment observation an agent will receive.

    For: A decoupled simulator for agentic reinforcement learning — agents practice against the model instead of live systems — with AgentWorldBench to measure it.

    Honest limitation: Its 'world' is the text an environment prints back; it simulates the interface an agent sees, not the machinery underneath.

    Qwen-AgentWorld: Language World Models for General Agents — Qwen Team, June 24, 2026

  • DreamZero

    action-conditioned

    NVIDIA GEAR · 2026

    simulatorplannerpixels / video

    A World Action Model on a video-diffusion backbone that jointly predicts future world states and actions — the world model is the policy.

    For: Zero-shot robot control: real-time closed-loop action at roughly 150 ms per action chunk, with reported 2× generalization over vision-language-action baselines.

    Honest limitation: The headline numbers are the lab's own reported experiments; independent, comparable evaluation is exactly what the field still lacks.

    DreamZero: World Action Models are Zero-shot Policies — NVIDIA GEAR, 2026 · Beyond the Video Hype: Why World Models Feel Different in 2026 — Graison Thomas, Medium, April 2026

Featured: the world model that speaks

Every system above simulates space: pixels, scenes, roads, arms. Qwen-AgentWorld is the twist — a world model whose 'world' is the terminal, the browser, the operating system. When a software agent runs a command, something has to play the role of reality and answer it. Qwen-AgentWorld learns to be that reality: given the agent's action, it predicts the observation the environment will return.

The reason is the same as everywhere else in this story: practicing against the real thing is slow, expensive, and sometimes destructive. A learned simulator of the digital world lets agents train against imagined terminals and websites — decoupled from live systems — before touching real ones.

Source: Qwen-AgentWorld: Language World Models for General Agents — Qwen Team, June 24, 2026 · June 24, 2026

It sounds abstract until you sit in the model's chair. So sit in it. Below, an agent works on a bug — and you are the environment it acts on.

You are the world model

predictions right: 0/6

# An agent is fixing a bug in a small repo.

# The terminal's replies are hidden — YOU must predict them.

$ ls

What does the environment print back? (1/6)

Why any of this matters

Four honest paragraphs — what world models are actually for, each with its receipts.

The temperature of the race

The scale signals, from secondary reporting: Yann LeCun's AMI Labs raised €500M at a €3B valuation to pursue world-model-centric AI; Genie 3 shipped; Marble launched commercially; Cosmos passed two million downloads. Read these as market temperature, not ground truth — they come from a single roundup.

Source: World Models Race 2026 — Introl blog, January 2026

Closing: you started by catching a ball

The model in your browser has five thousand weights. The systems above have billions, training runs that cost more than buildings, and teams of hundreds. It would be easy to say they're different kinds of thing. They are not. Encode what is; predict what happens next, given what you do; act on the prediction; drift, and be corrected by reality. You watched every link of that chain run in a tab, on code short enough to read with your coffee — the difference is scale, not kind.

And the loop closes further back than Level 3. The first world model in this story was never the network — it was you, catching a ball.