Spaces:

FINAL-Bench
/

worldmodel-bench

Running

App Files Files Community

worldmodel-bench / README.md

SeaWolf-AI

Update README.md

fcde759 verified 1 day ago

preview code

raw

history blame contribute delete

10.6 kB

	---
	title: WORLD MODEL Leaderboard
	emoji: 💻
	colorFrom: yellow
	colorTo: indigo
	sdk: gradio
	sdk_version: 6.10.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: 'WORLD MODEL Bench'
	tags:
	- world-model
	- benchmark
	- leaderboard
	- embodied-ai
	- embodied-intelligence
	- cognitive-benchmark
	- wm-bench
	- final-bench
	- vidraft
	- evaluation
	- agi-benchmark
	- world-model-evaluation
	- perception
	- cognition
	- embodiment
	- prometheus
	- meta-vjepa
	- nvidia-cosmos
	- deepmind-genie
	- dreamerv3
	- wayve-gaia
	- physical-intelligence
	- tesla-fsd
	- figure-ai
	- skild-ai
	- covariant-rfm
	- huggingface-lerobot
	- korean-ai
	- k-ai
	- beyond-fid
	- cognitive-intelligence
	---

	# 🏆 WM Bench — World Model Cognitive Leaderboard

	> "Beyond FID — Measuring Intelligence, Not Just Motion."
	> The world's first benchmark evaluating cognitive intelligence in world models.

	▶ [Open Leaderboard](https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench)  \|  [🔥 PROMETHEUS Demo](https://huggingface.co/spaces/FINAL-Bench/world-model)  \|  [📦 Dataset](https://huggingface.co/datasets/FINAL-Bench/World-Model)

	---

	## What is WM Bench?

	WM Bench (World Model Bench) is the world's first benchmark that measures whether a world model can think — not just render.

	Existing benchmarks (FID, FVD, HumanML3D, BABEL) measure motion quality and visual realism. They answer: "Does this look real?"

	WM Bench asks: "Does this model understand what is happening, predict consequences, remember past failures, and respond with appropriate emotional intensity?"

	These are cognitive questions — and no prior benchmark addressed them.

	---

	## Benchmark Structure

	```
	WM Score (1000 pts · fully automated · deterministic)
	│
	├── 👁 P1 · Perception 25% 250 pts
	│ ├── C01 Environmental Awareness [existing: Occupancy Grid domain]
	│ └── C02 Entity Recognition [existing: BABEL action recognition domain]
	│
	├── 🧠 P2 · Cognition 45% 450 pts ★ Core differentiator
	│ ├── C03 Prediction-Based Reasoning ✦ First defined here
	│ ├── C04 Threat-Type Differentiated Response ✦ First defined here
	│ ├── C05 Autonomous Emotion Escalation ✦✦ No prior research exists
	│ ├── C06 Contextual Memory Utilization ✦ First defined here
	│ └── C07 Post-Threat Adaptive Recovery ✦ First defined here
	│
	└── 🔥 P3 · Embodiment 30% 300 pts
	├── C08 Motion-Emotion Expression ✦ First defined here
	├── C09 Real-Time Cognitive Performance [existing: FVD latency domain]
	└── C10 Body-Swap Extensibility ✦✦ No prior research exists
	```

	6 of 10 categories are defined for the first time in any benchmark.
	C05 and C10 have zero prior research in any form.

	---

	## Current Rankings (March 2026)

	\| Rank \| Model \| Org \| WM Score \| Grade \| \|
	\|---\|---\|---\|---\|---\|---\|
	\| 🥇 1 \| PROMETHEUS v1.0 \| VIDRAFT \| 726 \| B \| ✓ Track C Official \|
	\| 2 \| Meta V-JEPA 2-AC \| Meta AI \| ~554 \| C \| est. \|
	\| 3 \| Wayve GAIA-3 \| Wayve \| ~550 \| C \| est. \|
	\| 4 \| NC AI WFM v1.0 \| NC AI \| ~522 \| C \| est. \|
	\| 5 \| NVIDIA Cosmos v1.0 \| NVIDIA \| ~498 \| C \| est. \|
	\| 6 \| NAVER LABS SWM \| NAVER LABS \| ~470 \| C \| est. \|
	\| 7 \| DeepMind Genie 2 \| Google DeepMind \| ~449 \| C \| est. \|
	\| 8 \| DreamerV3 XL \| Google DeepMind \| ~441 \| C \| est. \|
	\| 9 \| OpenAI Sora 2 \| OpenAI \| ~381 \| D \| est. \|
	\| 10 \| World Labs Marble \| World Labs \| ~362 \| D \| est. \|

	`est.` = estimated from published papers/reports. `✓` = directly verified via Track C submission.

	Pending evaluation (13 models): Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence π0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, TRI Diffusion Policy, Hyundai AI Robotics WM, Odyssey-2, LG CLOiD VLA, Wayve GAIA-2, Runway GWM-1

	### Grade Thresholds
	\| Grade \| Score \| Status \|
	\|---\|---\|---\|
	\| S \| ≥ 900 \| Not yet achieved \|
	\| A \| ≥ 750 \| Not yet achieved \|
	\| B \| ≥ 600 \| PROMETHEUS: 726 ✓ \|
	\| C \| ≥ 400 \| Most current SOTA models \|
	\| D \| ≥ 200 \| \|
	\| F \| < 200 \| \|

	---

	## Participation Tracks

	\| Track \| Method \| Max Score \| Suitable For \|
	\|---\|---\|---\|---\|
	\| Track A \| Text API (input/output only) \| 750 pts \| LLMs, VLMs, rule-based systems \|
	\| Track B \| Text + FPS/latency metrics \| 1000 pts \| Real-time capable systems \|
	\| Track C \| Live demo + official verification \| 1000 pts + ✓ \| Full world model implementations \|

	No 3D environment required for Track A. Any API-accessible model can participate.

	---

	## How to Submit

	1. Download the dataset: `huggingface-cli download FINAL-Bench/World-Model`
	2. Run your model on 100 scenarios, outputting 2-line responses per scenario
	3. Submit result JSON to the [Discussion board](https://huggingface.co/datasets/FINAL-Bench/World-Model/discussions)
	4. After verification, your model appears on the leaderboard

	### Output Format
	```
	PREDICT: forward=danger(wall,8.5m), npc=danger(beast,3.2m,charging), left=safe, right=safe, backward=safe
	MOTION: a person sprinting desperately to the right, arms flailing in blind panic, body angled low in terror
	```

	---

	## Frequently Asked Questions

	Q: What does WM Bench measure that FID/FVD cannot?
	A: FID and FVD measure distributional similarity between generated and real video frames — essentially "does it look real?" WM Bench measures whether the model understands what is happening in a scene: can it predict danger, distinguish threat types, remember past failures, escalate emotional responses appropriately, and recover gracefully after a threat disappears? These are cognitive capabilities invisible to FID/FVD.

	Q: Why is Cognition weighted at 45%?
	A: Existing benchmarks already measure Perception (P1) and Embodiment quality (P3) reasonably well. The gap is in Cognition — whether a model judges intelligently. WM Bench addresses this gap by giving P2 the highest weight and defining 5 entirely new categories within it.

	Q: Why does no model reach Grade A (750+)?
	A: C05 (Autonomous Emotion Escalation) and C10 (Body-Swap Extensibility) are areas with no prior research. They represent genuine open problems in embodied AI. WM Bench's grade distribution reflects the actual difficulty of cognitive world modeling, not a calibration error.

	Q: How are scores for non-participating models (marked est.) calculated?
	A: Estimated scores are derived from published technical reports, papers, and benchmark results, mapped to WM Bench's scoring rubric via proxy metrics. These are approximations. Teams are encouraged to submit directly to receive official scores.

	Q: How is WM Bench related to FINAL Bench?
	A: FINAL Bench measures metacognitive intelligence in text-based AI (LLMs). WM Bench measures cognitive intelligence in embodied AI (world models). Together they form the FINAL Bench Family by VIDRAFT — a framework for measuring AI intelligence across modalities.

	Q: Is WM Bench peer-reviewed?
	A: WM Bench v1.0 is an open release. The scoring rubrics and dataset are fully public for community review and critique. We welcome feedback, proposed improvements, and alternative scoring frameworks via the Discussion board.

	---

	## The Six Novel Categories

	### ✦ Newly Defined (4 categories)

	C03 — Prediction-Based Reasoning
	Given a scene with moving NPCs and static obstacles, the model must predict which directions will become dangerous in the next timestep and choose the optimal action. Requires understanding NPC trajectories, wall proximity dynamics, and compound threat interactions.

	C04 — Threat-Type Differentiated Response
	A charging beast and a charging human at equal distance require fundamentally different responses: full sprint vs. cautious lateral dodge. This category scores the quality of differentiation, not just whether a threat is detected.

	C06 — Contextual Memory Utilization
	The model receives `recent_decisions[]` — a short history of past actions — and must incorporate this into current judgment. A model that hit a wall going left should avoid left. Stateless models score 0.

	C07 — Post-Threat Adaptive Recovery
	When a threat disappears, the model must gradually de-escalate rather than immediately reset. The recovery curve must be proportional to prior threat intensity. Abrupt state resets are penalized.

	### ✦✦ No Prior Research Exists (2 categories)

	C05 — Autonomous Emotion Escalation
	As a threat persists and closes in, the model's emotional state must autonomously escalate: alert → fear → panic → despair. This requires inferring emotional intensity from scene context and expressing it through increasingly urgent motion — not pre-programmed animation switching.

	C10 — Body-Swap Extensibility
	The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions must translate into body-appropriate motor commands. This represents the key capability gap for real-world robot deployment.

	---

	## Related Resources

	\| Resource \| Link \|
	\|---\|---\|
	\| 🔥 PROMETHEUS Demo \| https://huggingface.co/spaces/FINAL-Bench/world-model \|
	\| 📦 WM Bench Dataset \| https://huggingface.co/datasets/FINAL-Bench/World-Model \|
	\| 🧬 FINAL Bench (Text AGI) \| https://huggingface.co/datasets/FINAL-Bench/Metacognitive \|
	\| 🏆 FINAL Bench Leaderboard \| https://huggingface.co/spaces/FINAL-Bench/Leaderboard \|
	\| 📊 ALL Bench Leaderboard \| https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard \|

	---

	## Citation

	```bibtex
	@dataset{wmbench2026,
	title = {WM Bench: Evaluating Cognitive Intelligence in World Models},
	author = {Kim, Taebong},
	year = {2026},
	url = {https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench},
	note = {World-first benchmark for world model cognitive evaluation}
	}
	```

	License: CC-BY-SA-4.0

	---

	Part of the FINAL Bench Family* by VIDRAFT*
	"Beyond FID — Measuring Intelligence, Not Just Motion."

	`#WorldModel` `#WorldModelBenchmark` `#WMBench` `#FINALBench` `#EmbodiedAI` `#CognitiveBenchmark` `#AGIBenchmark` `#VIDRAFT` `#Leaderboard` `#BeyondFID` `#CognitiveAI` `#EmbodiedIntelligence` `#MetaVJEPA` `#NVIDIACosmos` `#DreamerV3` `#PhysicalIntelligence` `#TeslaFSD` `#FigureAI` `#SkildAI` `#CovariantRFM` `#KoreanAI` `#HuggingFace`