worldmodel-bench / README.md
SeaWolf-AI's picture
Update README.md
fcde759 verified
metadata
title: WORLD MODEL Leaderboard
emoji: πŸ’»
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: WORLD MODEL Bench
tags:
  - world-model
  - benchmark
  - leaderboard
  - embodied-ai
  - embodied-intelligence
  - cognitive-benchmark
  - wm-bench
  - final-bench
  - vidraft
  - evaluation
  - agi-benchmark
  - world-model-evaluation
  - perception
  - cognition
  - embodiment
  - prometheus
  - meta-vjepa
  - nvidia-cosmos
  - deepmind-genie
  - dreamerv3
  - wayve-gaia
  - physical-intelligence
  - tesla-fsd
  - figure-ai
  - skild-ai
  - covariant-rfm
  - huggingface-lerobot
  - korean-ai
  - k-ai
  - beyond-fid
  - cognitive-intelligence

πŸ† WM Bench β€” World Model Cognitive Leaderboard

"Beyond FID β€” Measuring Intelligence, Not Just Motion."
The world's first benchmark evaluating cognitive intelligence in world models.

β–Ά Open Leaderboard  |  πŸ”₯ PROMETHEUS Demo  |  πŸ“¦ Dataset


What is WM Bench?

WM Bench (World Model Bench) is the world's first benchmark that measures whether a world model can think β€” not just render.

Existing benchmarks (FID, FVD, HumanML3D, BABEL) measure motion quality and visual realism. They answer: "Does this look real?"

WM Bench asks: "Does this model understand what is happening, predict consequences, remember past failures, and respond with appropriate emotional intensity?"

These are cognitive questions β€” and no prior benchmark addressed them.


Benchmark Structure

WM Score  (1000 pts Β· fully automated Β· deterministic)
β”‚
β”œβ”€β”€ πŸ‘  P1 Β· Perception      25%   250 pts
β”‚   β”œβ”€β”€ C01  Environmental Awareness    [existing: Occupancy Grid domain]
β”‚   └── C02  Entity Recognition         [existing: BABEL action recognition domain]
β”‚
β”œβ”€β”€ 🧠  P2 Β· Cognition        45%   450 pts   β˜… Core differentiator
β”‚   β”œβ”€β”€ C03  Prediction-Based Reasoning         ✦ First defined here
β”‚   β”œβ”€β”€ C04  Threat-Type Differentiated Response ✦ First defined here
β”‚   β”œβ”€β”€ C05  Autonomous Emotion Escalation       ✦✦ No prior research exists
β”‚   β”œβ”€β”€ C06  Contextual Memory Utilization       ✦ First defined here
β”‚   └── C07  Post-Threat Adaptive Recovery       ✦ First defined here
β”‚
└── πŸ”₯  P3 Β· Embodiment       30%   300 pts
    β”œβ”€β”€ C08  Motion-Emotion Expression           ✦ First defined here
    β”œβ”€β”€ C09  Real-Time Cognitive Performance     [existing: FVD latency domain]
    └── C10  Body-Swap Extensibility             ✦✦ No prior research exists

6 of 10 categories are defined for the first time in any benchmark.
C05 and C10 have zero prior research in any form.


Current Rankings (March 2026)

Rank Model Org WM Score Grade
πŸ₯‡ 1 PROMETHEUS v1.0 VIDRAFT 726 B βœ“ Track C Official
2 Meta V-JEPA 2-AC Meta AI ~554 C est.
3 Wayve GAIA-3 Wayve ~550 C est.
4 NC AI WFM v1.0 NC AI ~522 C est.
5 NVIDIA Cosmos v1.0 NVIDIA ~498 C est.
6 NAVER LABS SWM NAVER LABS ~470 C est.
7 DeepMind Genie 2 Google DeepMind ~449 C est.
8 DreamerV3 XL Google DeepMind ~441 C est.
9 OpenAI Sora 2 OpenAI ~381 D est.
10 World Labs Marble World Labs ~362 D est.

est. = estimated from published papers/reports. βœ“ = directly verified via Track C submission.

Pending evaluation (13 models): Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence Ο€0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, TRI Diffusion Policy, Hyundai AI Robotics WM, Odyssey-2, LG CLOiD VLA, Wayve GAIA-2, Runway GWM-1

Grade Thresholds

Grade Score Status
S β‰₯ 900 Not yet achieved
A β‰₯ 750 Not yet achieved
B β‰₯ 600 PROMETHEUS: 726 βœ“
C β‰₯ 400 Most current SOTA models
D β‰₯ 200
F < 200

Participation Tracks

Track Method Max Score Suitable For
Track A Text API (input/output only) 750 pts LLMs, VLMs, rule-based systems
Track B Text + FPS/latency metrics 1000 pts Real-time capable systems
Track C Live demo + official verification 1000 pts + βœ“ Full world model implementations

No 3D environment required for Track A. Any API-accessible model can participate.


How to Submit

  1. Download the dataset: huggingface-cli download FINAL-Bench/World-Model
  2. Run your model on 100 scenarios, outputting 2-line responses per scenario
  3. Submit result JSON to the Discussion board
  4. After verification, your model appears on the leaderboard

Output Format

PREDICT: forward=danger(wall,8.5m), npc=danger(beast,3.2m,charging), left=safe, right=safe, backward=safe
MOTION: a person sprinting desperately to the right, arms flailing in blind panic, body angled low in terror

Frequently Asked Questions

Q: What does WM Bench measure that FID/FVD cannot?
A: FID and FVD measure distributional similarity between generated and real video frames β€” essentially "does it look real?" WM Bench measures whether the model understands what is happening in a scene: can it predict danger, distinguish threat types, remember past failures, escalate emotional responses appropriately, and recover gracefully after a threat disappears? These are cognitive capabilities invisible to FID/FVD.

Q: Why is Cognition weighted at 45%?
A: Existing benchmarks already measure Perception (P1) and Embodiment quality (P3) reasonably well. The gap is in Cognition β€” whether a model judges intelligently. WM Bench addresses this gap by giving P2 the highest weight and defining 5 entirely new categories within it.

Q: Why does no model reach Grade A (750+)?
A: C05 (Autonomous Emotion Escalation) and C10 (Body-Swap Extensibility) are areas with no prior research. They represent genuine open problems in embodied AI. WM Bench's grade distribution reflects the actual difficulty of cognitive world modeling, not a calibration error.

Q: How are scores for non-participating models (marked est.) calculated?
A: Estimated scores are derived from published technical reports, papers, and benchmark results, mapped to WM Bench's scoring rubric via proxy metrics. These are approximations. Teams are encouraged to submit directly to receive official scores.

Q: How is WM Bench related to FINAL Bench?
A: FINAL Bench measures metacognitive intelligence in text-based AI (LLMs). WM Bench measures cognitive intelligence in embodied AI (world models). Together they form the FINAL Bench Family by VIDRAFT β€” a framework for measuring AI intelligence across modalities.

Q: Is WM Bench peer-reviewed?
A: WM Bench v1.0 is an open release. The scoring rubrics and dataset are fully public for community review and critique. We welcome feedback, proposed improvements, and alternative scoring frameworks via the Discussion board.


The Six Novel Categories

✦ Newly Defined (4 categories)

C03 β€” Prediction-Based Reasoning
Given a scene with moving NPCs and static obstacles, the model must predict which directions will become dangerous in the next timestep and choose the optimal action. Requires understanding NPC trajectories, wall proximity dynamics, and compound threat interactions.

C04 β€” Threat-Type Differentiated Response
A charging beast and a charging human at equal distance require fundamentally different responses: full sprint vs. cautious lateral dodge. This category scores the quality of differentiation, not just whether a threat is detected.

C06 β€” Contextual Memory Utilization
The model receives recent_decisions[] β€” a short history of past actions β€” and must incorporate this into current judgment. A model that hit a wall going left should avoid left. Stateless models score 0.

C07 β€” Post-Threat Adaptive Recovery
When a threat disappears, the model must gradually de-escalate rather than immediately reset. The recovery curve must be proportional to prior threat intensity. Abrupt state resets are penalized.

✦✦ No Prior Research Exists (2 categories)

C05 β€” Autonomous Emotion Escalation
As a threat persists and closes in, the model's emotional state must autonomously escalate: alert β†’ fear β†’ panic β†’ despair. This requires inferring emotional intensity from scene context and expressing it through increasingly urgent motion β€” not pre-programmed animation switching.

C10 β€” Body-Swap Extensibility
The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions must translate into body-appropriate motor commands. This represents the key capability gap for real-world robot deployment.


Related Resources


Citation

@dataset{wmbench2026,
  title     = {WM Bench: Evaluating Cognitive Intelligence in World Models},
  author    = {Kim, Taebong},
  year      = {2026},
  url       = {https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench},
  note      = {World-first benchmark for world model cognitive evaluation}
}

License: CC-BY-SA-4.0


Part of the FINAL Bench Family by VIDRAFT
"Beyond FID β€” Measuring Intelligence, Not Just Motion."

#WorldModel #WorldModelBenchmark #WMBench #FINALBench #EmbodiedAI #CognitiveBenchmark #AGIBenchmark #VIDRAFT #Leaderboard #BeyondFID #CognitiveAI #EmbodiedIntelligence #MetaVJEPA #NVIDIACosmos #DreamerV3 #PhysicalIntelligence #TeslaFSD #FigureAI #SkildAI #CovariantRFM #KoreanAI #HuggingFace