Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
SeaWolf-AIΒ 
posted an update 2 days ago
Post
4508
🌍 World Model Bench β€” does your world model actually think?

FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.

We just released WM Bench β€” the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint β€” not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?

Those are cognitive questions. No existing benchmark asks them. So we built one.

3 Pillars Β· 10 Categories Β· 100 Scenarios Β· 1,000-point scale

- πŸ‘ P1 Perception (25%) β€” Can it read the scene?
- 🧠 P2 Cognition (45%) β€” Does it predict threats, escalate emotions, utilize memory?
- πŸ”₯ P3 Embodiment (30%) β€” Does the body respond with the right motion?

All evaluation is via simple JSON I/O β€” no 3D engine, no special hardware. Any model with an API can participate.

We also built PROMETHEUS as a live reference implementation β€” runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive β†’ Predict β†’ Decide β†’ Act). Scored 726/1000 (Grade B) on Track C β€” the only directly verified model so far. Submissions from other teams very welcome.

---

πŸ—‚ Dataset β†’ FINAL-Bench/World-Model
🌍 Demo β†’ FINAL-Bench/World-Model
πŸ† Leaderboard β†’ FINAL-Bench/worldmodel-bench
πŸ“ Article β†’ https://huggingface.co/blog/FINAL-Bench/world-model

Part of the FINAL Bench Family β€” alongside FINAL Bench (Feb 2026). Feedback on rubrics and missing models always welcome!
In this post