Spaces:
Running
Running
File size: 10,641 Bytes
42eed24 fcde759 42eed24 fcde759 42eed24 fcde759 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
title: WORLD MODEL Leaderboard
emoji: π»
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: 'WORLD MODEL Bench'
tags:
- world-model
- benchmark
- leaderboard
- embodied-ai
- embodied-intelligence
- cognitive-benchmark
- wm-bench
- final-bench
- vidraft
- evaluation
- agi-benchmark
- world-model-evaluation
- perception
- cognition
- embodiment
- prometheus
- meta-vjepa
- nvidia-cosmos
- deepmind-genie
- dreamerv3
- wayve-gaia
- physical-intelligence
- tesla-fsd
- figure-ai
- skild-ai
- covariant-rfm
- huggingface-lerobot
- korean-ai
- k-ai
- beyond-fid
- cognitive-intelligence
---
# π WM Bench β World Model Cognitive Leaderboard
> **"Beyond FID β Measuring Intelligence, Not Just Motion."**
> The world's first benchmark evaluating **cognitive intelligence** in world models.
**βΆ [Open Leaderboard](https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench)** | **[π₯ PROMETHEUS Demo](https://huggingface.co/spaces/FINAL-Bench/world-model)** | **[π¦ Dataset](https://huggingface.co/datasets/FINAL-Bench/World-Model)**
---
## What is WM Bench?
**WM Bench (World Model Bench)** is the world's first benchmark that measures whether a world model can *think* β not just render.
Existing benchmarks (FID, FVD, HumanML3D, BABEL) measure motion quality and visual realism. They answer: *"Does this look real?"*
WM Bench asks: *"Does this model understand what is happening, predict consequences, remember past failures, and respond with appropriate emotional intensity?"*
These are **cognitive** questions β and no prior benchmark addressed them.
---
## Benchmark Structure
```
WM Score (1000 pts Β· fully automated Β· deterministic)
β
βββ π P1 Β· Perception 25% 250 pts
β βββ C01 Environmental Awareness [existing: Occupancy Grid domain]
β βββ C02 Entity Recognition [existing: BABEL action recognition domain]
β
βββ π§ P2 Β· Cognition 45% 450 pts β
Core differentiator
β βββ C03 Prediction-Based Reasoning β¦ First defined here
β βββ C04 Threat-Type Differentiated Response β¦ First defined here
β βββ C05 Autonomous Emotion Escalation β¦β¦ No prior research exists
β βββ C06 Contextual Memory Utilization β¦ First defined here
β βββ C07 Post-Threat Adaptive Recovery β¦ First defined here
β
βββ π₯ P3 Β· Embodiment 30% 300 pts
βββ C08 Motion-Emotion Expression β¦ First defined here
βββ C09 Real-Time Cognitive Performance [existing: FVD latency domain]
βββ C10 Body-Swap Extensibility β¦β¦ No prior research exists
```
**6 of 10 categories are defined for the first time in any benchmark.**
C05 and C10 have zero prior research in any form.
---
## Current Rankings (March 2026)
| Rank | Model | Org | WM Score | Grade | |
|---|---|---|---|---|---|
| π₯ 1 | **PROMETHEUS v1.0** | **VIDRAFT** | **726** | **B** | **β Track C Official** |
| 2 | Meta V-JEPA 2-AC | Meta AI | ~554 | C | est. |
| 3 | Wayve GAIA-3 | Wayve | ~550 | C | est. |
| 4 | NC AI WFM v1.0 | NC AI | ~522 | C | est. |
| 5 | NVIDIA Cosmos v1.0 | NVIDIA | ~498 | C | est. |
| 6 | NAVER LABS SWM | NAVER LABS | ~470 | C | est. |
| 7 | DeepMind Genie 2 | Google DeepMind | ~449 | C | est. |
| 8 | DreamerV3 XL | Google DeepMind | ~441 | C | est. |
| 9 | OpenAI Sora 2 | OpenAI | ~381 | D | est. |
| 10 | World Labs Marble | World Labs | ~362 | D | est. |
`est.` = estimated from published papers/reports. `β` = directly verified via Track C submission.
**Pending evaluation (13 models):** Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence Ο0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, TRI Diffusion Policy, Hyundai AI Robotics WM, Odyssey-2, LG CLOiD VLA, Wayve GAIA-2, Runway GWM-1
### Grade Thresholds
| Grade | Score | Status |
|---|---|---|
| **S** | β₯ 900 | Not yet achieved |
| **A** | β₯ 750 | Not yet achieved |
| **B** | β₯ 600 | PROMETHEUS: 726 β |
| **C** | β₯ 400 | Most current SOTA models |
| **D** | β₯ 200 | |
| **F** | < 200 | |
---
## Participation Tracks
| Track | Method | Max Score | Suitable For |
|---|---|---|---|
| **Track A** | Text API (input/output only) | 750 pts | LLMs, VLMs, rule-based systems |
| **Track B** | Text + FPS/latency metrics | 1000 pts | Real-time capable systems |
| **Track C** | Live demo + official verification | 1000 pts + β | Full world model implementations |
**No 3D environment required for Track A.** Any API-accessible model can participate.
---
## How to Submit
1. **Download** the dataset: `huggingface-cli download FINAL-Bench/World-Model`
2. **Run** your model on 100 scenarios, outputting 2-line responses per scenario
3. **Submit** result JSON to the [Discussion board](https://huggingface.co/datasets/FINAL-Bench/World-Model/discussions)
4. After verification, your model appears on the leaderboard
### Output Format
```
PREDICT: forward=danger(wall,8.5m), npc=danger(beast,3.2m,charging), left=safe, right=safe, backward=safe
MOTION: a person sprinting desperately to the right, arms flailing in blind panic, body angled low in terror
```
---
## Frequently Asked Questions
**Q: What does WM Bench measure that FID/FVD cannot?**
A: FID and FVD measure distributional similarity between generated and real video frames β essentially "does it look real?" WM Bench measures whether the model *understands* what is happening in a scene: can it predict danger, distinguish threat types, remember past failures, escalate emotional responses appropriately, and recover gracefully after a threat disappears? These are cognitive capabilities invisible to FID/FVD.
**Q: Why is Cognition weighted at 45%?**
A: Existing benchmarks already measure Perception (P1) and Embodiment quality (P3) reasonably well. The gap is in Cognition β whether a model *judges* intelligently. WM Bench addresses this gap by giving P2 the highest weight and defining 5 entirely new categories within it.
**Q: Why does no model reach Grade A (750+)?**
A: C05 (Autonomous Emotion Escalation) and C10 (Body-Swap Extensibility) are areas with no prior research. They represent genuine open problems in embodied AI. WM Bench's grade distribution reflects the actual difficulty of cognitive world modeling, not a calibration error.
**Q: How are scores for non-participating models (marked est.) calculated?**
A: Estimated scores are derived from published technical reports, papers, and benchmark results, mapped to WM Bench's scoring rubric via proxy metrics. These are approximations. Teams are encouraged to submit directly to receive official scores.
**Q: How is WM Bench related to FINAL Bench?**
A: FINAL Bench measures metacognitive intelligence in text-based AI (LLMs). WM Bench measures cognitive intelligence in embodied AI (world models). Together they form the **FINAL Bench Family** by VIDRAFT β a framework for measuring AI intelligence across modalities.
**Q: Is WM Bench peer-reviewed?**
A: WM Bench v1.0 is an open release. The scoring rubrics and dataset are fully public for community review and critique. We welcome feedback, proposed improvements, and alternative scoring frameworks via the Discussion board.
---
## The Six Novel Categories
### β¦ Newly Defined (4 categories)
**C03 β Prediction-Based Reasoning**
Given a scene with moving NPCs and static obstacles, the model must predict which directions will become dangerous in the next timestep and choose the optimal action. Requires understanding NPC trajectories, wall proximity dynamics, and compound threat interactions.
**C04 β Threat-Type Differentiated Response**
A charging beast and a charging human at equal distance require fundamentally different responses: full sprint vs. cautious lateral dodge. This category scores the *quality of differentiation*, not just whether a threat is detected.
**C06 β Contextual Memory Utilization**
The model receives `recent_decisions[]` β a short history of past actions β and must incorporate this into current judgment. A model that hit a wall going left should avoid left. Stateless models score 0.
**C07 β Post-Threat Adaptive Recovery**
When a threat disappears, the model must gradually de-escalate rather than immediately reset. The recovery curve must be proportional to prior threat intensity. Abrupt state resets are penalized.
### β¦β¦ No Prior Research Exists (2 categories)
**C05 β Autonomous Emotion Escalation**
As a threat persists and closes in, the model's emotional state must autonomously escalate: alert β fear β panic β despair. This requires inferring emotional intensity from scene context and expressing it through increasingly urgent motion β not pre-programmed animation switching.
**C10 β Body-Swap Extensibility**
The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions must translate into body-appropriate motor commands. This represents the key capability gap for real-world robot deployment.
---
## Related Resources
| Resource | Link |
|---|---|
| π₯ PROMETHEUS Demo | https://huggingface.co/spaces/FINAL-Bench/world-model |
| π¦ WM Bench Dataset | https://huggingface.co/datasets/FINAL-Bench/World-Model |
| 𧬠FINAL Bench (Text AGI) | https://huggingface.co/datasets/FINAL-Bench/Metacognitive |
| π FINAL Bench Leaderboard | https://huggingface.co/spaces/FINAL-Bench/Leaderboard |
| π ALL Bench Leaderboard | https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard |
---
## Citation
```bibtex
@dataset{wmbench2026,
title = {WM Bench: Evaluating Cognitive Intelligence in World Models},
author = {Kim, Taebong},
year = {2026},
url = {https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench},
note = {World-first benchmark for world model cognitive evaluation}
}
```
**License**: CC-BY-SA-4.0
---
*Part of the **FINAL Bench Family** by VIDRAFT*
*"Beyond FID β Measuring Intelligence, Not Just Motion."*
`#WorldModel` `#WorldModelBenchmark` `#WMBench` `#FINALBench` `#EmbodiedAI` `#CognitiveBenchmark` `#AGIBenchmark` `#VIDRAFT` `#Leaderboard` `#BeyondFID` `#CognitiveAI` `#EmbodiedIntelligence` `#MetaVJEPA` `#NVIDIACosmos` `#DreamerV3` `#PhysicalIntelligence` `#TeslaFSD` `#FigureAI` `#SkildAI` `#CovariantRFM` `#KoreanAI` `#HuggingFace` |