Spaces:

FINAL-Bench
/

worldmodel-bench

Running

File size: 10,641 Bytes

---
title: WORLD MODEL Leaderboard
emoji: 💻
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: 'WORLD MODEL Bench'
tags:
  - world-model
  - benchmark
  - leaderboard
  - embodied-ai
  - embodied-intelligence
  - cognitive-benchmark
  - wm-bench
  - final-bench
  - vidraft
  - evaluation
  - agi-benchmark
  - world-model-evaluation
  - perception
  - cognition
  - embodiment
  - prometheus
  - meta-vjepa
  - nvidia-cosmos
  - deepmind-genie
  - dreamerv3
  - wayve-gaia
  - physical-intelligence
  - tesla-fsd
  - figure-ai
  - skild-ai
  - covariant-rfm
  - huggingface-lerobot
  - korean-ai
  - k-ai
  - beyond-fid
  - cognitive-intelligence
---

# 🏆 WM Bench — World Model Cognitive Leaderboard

> **"Beyond FID — Measuring Intelligence, Not Just Motion."**  
> The world's first benchmark evaluating **cognitive intelligence** in world models.

**▶ [Open Leaderboard](https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench)** &nbsp;|&nbsp; **[🔥 PROMETHEUS Demo](https://huggingface.co/spaces/FINAL-Bench/world-model)** &nbsp;|&nbsp; **[📦 Dataset](https://huggingface.co/datasets/FINAL-Bench/World-Model)**

---

## What is WM Bench?

**WM Bench (World Model Bench)** is the world's first benchmark that measures whether a world model can *think* — not just render.

Existing benchmarks (FID, FVD, HumanML3D, BABEL) measure motion quality and visual realism. They answer: *"Does this look real?"*

WM Bench asks: *"Does this model understand what is happening, predict consequences, remember past failures, and respond with appropriate emotional intensity?"*

These are **cognitive** questions — and no prior benchmark addressed them.

---

## Benchmark Structure

```
WM Score  (1000 pts · fully automated · deterministic)
│
├── 👁  P1 · Perception      25%   250 pts
│   ├── C01  Environmental Awareness    [existing: Occupancy Grid domain]
│   └── C02  Entity Recognition         [existing: BABEL action recognition domain]
│
├── 🧠  P2 · Cognition        45%   450 pts   ★ Core differentiator
│   ├── C03  Prediction-Based Reasoning         ✦ First defined here
│   ├── C04  Threat-Type Differentiated Response ✦ First defined here
│   ├── C05  Autonomous Emotion Escalation       ✦✦ No prior research exists
│   ├── C06  Contextual Memory Utilization       ✦ First defined here
│   └── C07  Post-Threat Adaptive Recovery       ✦ First defined here
│
└── 🔥  P3 · Embodiment       30%   300 pts
    ├── C08  Motion-Emotion Expression           ✦ First defined here
    ├── C09  Real-Time Cognitive Performance     [existing: FVD latency domain]
    └── C10  Body-Swap Extensibility             ✦✦ No prior research exists
```

**6 of 10 categories are defined for the first time in any benchmark.**  
C05 and C10 have zero prior research in any form.

---

## Current Rankings (March 2026)

| Rank | Model | Org | WM Score | Grade | |
|---|---|---|---|---|---|
| 🥇 1 | **PROMETHEUS v1.0** | **VIDRAFT** | **726** | **B** | **✓ Track C Official** |
| 2 | Meta V-JEPA 2-AC | Meta AI | ~554 | C | est. |
| 3 | Wayve GAIA-3 | Wayve | ~550 | C | est. |
| 4 | NC AI WFM v1.0 | NC AI | ~522 | C | est. |
| 5 | NVIDIA Cosmos v1.0 | NVIDIA | ~498 | C | est. |
| 6 | NAVER LABS SWM | NAVER LABS | ~470 | C | est. |
| 7 | DeepMind Genie 2 | Google DeepMind | ~449 | C | est. |
| 8 | DreamerV3 XL | Google DeepMind | ~441 | C | est. |
| 9 | OpenAI Sora 2 | OpenAI | ~381 | D | est. |
| 10 | World Labs Marble | World Labs | ~362 | D | est. |

`est.` = estimated from published papers/reports. `✓` = directly verified via Track C submission.

**Pending evaluation (13 models):** Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence π0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, TRI Diffusion Policy, Hyundai AI Robotics WM, Odyssey-2, LG CLOiD VLA, Wayve GAIA-2, Runway GWM-1

### Grade Thresholds
| Grade | Score | Status |
|---|---|---|
| **S** | ≥ 900 | Not yet achieved |
| **A** | ≥ 750 | Not yet achieved |
| **B** | ≥ 600 | PROMETHEUS: 726 ✓ |
| **C** | ≥ 400 | Most current SOTA models |
| **D** | ≥ 200 | |
| **F** | < 200 | |

---

## Participation Tracks

| Track | Method | Max Score | Suitable For |
|---|---|---|---|
| **Track A** | Text API (input/output only) | 750 pts | LLMs, VLMs, rule-based systems |
| **Track B** | Text + FPS/latency metrics | 1000 pts | Real-time capable systems |
| **Track C** | Live demo + official verification | 1000 pts + ✓ | Full world model implementations |

**No 3D environment required for Track A.** Any API-accessible model can participate.

---

## How to Submit

1. **Download** the dataset: `huggingface-cli download FINAL-Bench/World-Model`
2. **Run** your model on 100 scenarios, outputting 2-line responses per scenario
3. **Submit** result JSON to the [Discussion board](https://huggingface.co/datasets/FINAL-Bench/World-Model/discussions)
4. After verification, your model appears on the leaderboard

### Output Format
```
PREDICT: forward=danger(wall,8.5m), npc=danger(beast,3.2m,charging), left=safe, right=safe, backward=safe
MOTION: a person sprinting desperately to the right, arms flailing in blind panic, body angled low in terror
```

---

## Frequently Asked Questions

**Q: What does WM Bench measure that FID/FVD cannot?**  
A: FID and FVD measure distributional similarity between generated and real video frames — essentially "does it look real?" WM Bench measures whether the model *understands* what is happening in a scene: can it predict danger, distinguish threat types, remember past failures, escalate emotional responses appropriately, and recover gracefully after a threat disappears? These are cognitive capabilities invisible to FID/FVD.

**Q: Why is Cognition weighted at 45%?**  
A: Existing benchmarks already measure Perception (P1) and Embodiment quality (P3) reasonably well. The gap is in Cognition — whether a model *judges* intelligently. WM Bench addresses this gap by giving P2 the highest weight and defining 5 entirely new categories within it.

**Q: Why does no model reach Grade A (750+)?**  
A: C05 (Autonomous Emotion Escalation) and C10 (Body-Swap Extensibility) are areas with no prior research. They represent genuine open problems in embodied AI. WM Bench's grade distribution reflects the actual difficulty of cognitive world modeling, not a calibration error.

**Q: How are scores for non-participating models (marked est.) calculated?**  
A: Estimated scores are derived from published technical reports, papers, and benchmark results, mapped to WM Bench's scoring rubric via proxy metrics. These are approximations. Teams are encouraged to submit directly to receive official scores.

**Q: How is WM Bench related to FINAL Bench?**  
A: FINAL Bench measures metacognitive intelligence in text-based AI (LLMs). WM Bench measures cognitive intelligence in embodied AI (world models). Together they form the **FINAL Bench Family** by VIDRAFT — a framework for measuring AI intelligence across modalities.

**Q: Is WM Bench peer-reviewed?**  
A: WM Bench v1.0 is an open release. The scoring rubrics and dataset are fully public for community review and critique. We welcome feedback, proposed improvements, and alternative scoring frameworks via the Discussion board.

---

## The Six Novel Categories

### ✦ Newly Defined (4 categories)

**C03 — Prediction-Based Reasoning**  
Given a scene with moving NPCs and static obstacles, the model must predict which directions will become dangerous in the next timestep and choose the optimal action. Requires understanding NPC trajectories, wall proximity dynamics, and compound threat interactions.

**C04 — Threat-Type Differentiated Response**  
A charging beast and a charging human at equal distance require fundamentally different responses: full sprint vs. cautious lateral dodge. This category scores the *quality of differentiation*, not just whether a threat is detected.

**C06 — Contextual Memory Utilization**  
The model receives `recent_decisions[]` — a short history of past actions — and must incorporate this into current judgment. A model that hit a wall going left should avoid left. Stateless models score 0.

**C07 — Post-Threat Adaptive Recovery**  
When a threat disappears, the model must gradually de-escalate rather than immediately reset. The recovery curve must be proportional to prior threat intensity. Abrupt state resets are penalized.

### ✦✦ No Prior Research Exists (2 categories)

**C05 — Autonomous Emotion Escalation**  
As a threat persists and closes in, the model's emotional state must autonomously escalate: alert → fear → panic → despair. This requires inferring emotional intensity from scene context and expressing it through increasingly urgent motion — not pre-programmed animation switching.

**C10 — Body-Swap Extensibility**  
The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions must translate into body-appropriate motor commands. This represents the key capability gap for real-world robot deployment.

---

## Related Resources

| Resource | Link |
|---|---|
| 🔥 PROMETHEUS Demo | https://huggingface.co/spaces/FINAL-Bench/world-model |
| 📦 WM Bench Dataset | https://huggingface.co/datasets/FINAL-Bench/World-Model |
| 🧬 FINAL Bench (Text AGI) | https://huggingface.co/datasets/FINAL-Bench/Metacognitive |
| 🏆 FINAL Bench Leaderboard | https://huggingface.co/spaces/FINAL-Bench/Leaderboard |
| 📊 ALL Bench Leaderboard | https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard |

---

## Citation

```bibtex
@dataset{wmbench2026,
  title     = {WM Bench: Evaluating Cognitive Intelligence in World Models},
  author    = {Kim, Taebong},
  year      = {2026},
  url       = {https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench},
  note      = {World-first benchmark for world model cognitive evaluation}
}
```

**License**: CC-BY-SA-4.0

---

*Part of the **FINAL Bench Family** by VIDRAFT*  
*"Beyond FID — Measuring Intelligence, Not Just Motion."*

`#WorldModel` `#WorldModelBenchmark` `#WMBench` `#FINALBench` `#EmbodiedAI` `#CognitiveBenchmark` `#AGIBenchmark` `#VIDRAFT` `#Leaderboard` `#BeyondFID` `#CognitiveAI` `#EmbodiedIntelligence` `#MetaVJEPA` `#NVIDIACosmos` `#DreamerV3` `#PhysicalIntelligence` `#TeslaFSD` `#FigureAI` `#SkildAI` `#CovariantRFM` `#KoreanAI` `#HuggingFace`