OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
Abstract
OpenWorldLib presents a standardized framework for advanced world models that integrate perception, interaction, and long-term memory capabilities for comprehensive world understanding and prediction.
World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib
Community
Hello everyone, welcome to follow our work. Given the current diversity of research on world models, we aim to provide a unified definition and calling standard for world models, establishing a clear boundary for this direction. If you are interested, or would like to promote your own work related to world models, please feel free to raise an issue in our code link: https://github.com/OpenDCAI/OpenWorldLib .
One thing to note: because we aim to cover as many methods as possible, the environment is relatively complex. This codebase primarily supports inference for different world model tasks. For training, reward settings, and similar aspects, this project does not currently support them. In our next project, we will focus on training and optimizing the lightest and most effective model for each task.
Based on the paper "OpenWorldLib: A Unified Codebase and Definition of Advanced World Models", here are the main results explained:
1. Interactive Video Generation Results
The evaluation covers navigation video generation (camera movement) and interactive video generation (physical interactions). Key findings include:
- Matrix-Game-2: Offers fast generation speeds but suffers from noticeable color shifting during long-horizon generation
- Lingbot-World, Hunyuan-GameCraft, and YUME-1.5: Successfully support high-quality navigation video generation
- Hunyuan-WorldPlay: Achieves the best overall visual performance for navigation video generation
- Wan-IT2V: Can execute basic interactive generation but struggles with maintaining physical consistency
- WoW (World Omniscient World Model): Supports diverse functionalities but has significantly inferior generation quality and physical realism compared to Cosmos
2. 3D Generation Results
The 3D generation pipeline supports scene reconstruction with movement controls and camera viewpoint adjustments:
- VGGT and InfiniteVGGT: Can generate 3D scenes from different views but show geometric inconsistency and texture blurring in complex areas when the camera moves significantly
- FlashWorld: Provides faster generation but balancing steady shapes with sharp details remains a major challenge
- Despite limitations, 3D generation remains crucial for realistic physical simulation in world models
3. Vision-Language-Action (VLA) Generation Results
The framework evaluates embodied AI through two simulation paradigms:
- AI2-THOR: Used for embodied video generation with photorealistic scene rendering
- LIBERO: Used for VLA evaluation with physically grounded manipulation environments
Key models evaluated:
- π₀ and π₀.₅: Leverage PaliGemma vision-language backbone with mixture-of-experts (MoE) action heads for robust multi-task generalization
- LingBot-VA: Approaches tasks from a generative perspective using video diffusion architecture to jointly model visual future predictions and continuous action synthesis
4. Multimodal Reasoning Capabilities
The Reasoning module demonstrates:
- Spatial reasoning: Geometry-centric queries, object relations, and step-by-step spatial deductions from visual inputs
- Omni/general reasoning: Operating over mixed modalities (text, images, audio, video) for broad instruction following
- Function: Converts internal perception and memory into grounded decisions, explanations, and plans that guide downstream generation or control
Framework Architecture Overview
OpenWorldLib unifies these capabilities through modular components:
Key Insight: The paper establishes that while current world models excel at next-frame prediction, significant challenges remain in maintaining physical consistency during long-horizon interactions and balancing generation speed with quality across video, 3D, and embodied action tasks.
Get this paper in your agent:
hf papers read 2604.04707 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper




