arxiv:2604.04707

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Published on Apr 6

· Submitted by

taesiri on Apr 7

Authors:

Bohan Zeng ,

Kaixin Zhu ,

Yifan Dai ,

Yuran Wang ,

Mingkun Chang ,

Xiaochen Ma ,

Abstract

OpenWorldLib presents a standardized framework for advanced world models that integrate perception, interaction, and long-term memory capabilities for comprehensive world understanding and prediction.

AI-generated summary

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

View arXiv page View PDF Project page GitHub 367 Add to collection

Community

zbhpku

Paper author about 8 hours ago

Hello everyone, welcome to follow our work. Given the current diversity of research on world models, we aim to provide a unified definition and calling standard for world models, establishing a clear boundary for this direction. If you are interested, or would like to promote your own work related to world models, please feel free to raise an issue in our code link: https://github.com/OpenDCAI/OpenWorldLib .

zbhpku

Paper author about 7 hours ago

One thing to note: because we aim to cover as many methods as possible, the environment is relatively complex. This codebase primarily supports inference for different world model tasks. For training, reward settings, and similar aspects, this project does not currently support them. In our next project, we will focus on training and optimizing the lightest and most effective model for each task.

kaier0852

about 5 hours ago

good job！

zbhpku

Paper author about 4 hours ago

Thank you very much!

mishig

about 4 hours ago

Based on the paper "OpenWorldLib: A Unified Codebase and Definition of Advanced World Models", here are the main results explained:

1. Interactive Video Generation Results

The evaluation covers navigation video generation (camera movement) and interactive video generation (physical interactions). Key findings include:

Matrix-Game-2: Offers fast generation speeds but suffers from noticeable color shifting during long-horizon generation
Lingbot-World, Hunyuan-GameCraft, and YUME-1.5: Successfully support high-quality navigation video generation
Hunyuan-WorldPlay: Achieves the best overall visual performance for navigation video generation
Wan-IT2V: Can execute basic interactive generation but struggles with maintaining physical consistency
WoW (World Omniscient World Model): Supports diverse functionalities but has significantly inferior generation quality and physical realism compared to Cosmos

2. 3D Generation Results

The 3D generation pipeline supports scene reconstruction with movement controls and camera viewpoint adjustments:

VGGT and InfiniteVGGT: Can generate 3D scenes from different views but show geometric inconsistency and texture blurring in complex areas when the camera moves significantly
FlashWorld: Provides faster generation but balancing steady shapes with sharp details remains a major challenge
Despite limitations, 3D generation remains crucial for realistic physical simulation in world models

3. Vision-Language-Action (VLA) Generation Results

The framework evaluates embodied AI through two simulation paradigms:

AI2-THOR: Used for embodied video generation with photorealistic scene rendering
LIBERO: Used for VLA evaluation with physically grounded manipulation environments

Key models evaluated:

π₀ and π₀.₅: Leverage PaliGemma vision-language backbone with mixture-of-experts (MoE) action heads for robust multi-task generalization
LingBot-VA: Approaches tasks from a generative perspective using video diffusion architecture to jointly model visual future predictions and continuous action synthesis

4. Multimodal Reasoning Capabilities

The Reasoning module demonstrates:

Spatial reasoning: Geometry-centric queries, object relations, and step-by-step spatial deductions from visual inputs
Omni/general reasoning: Operating over mixed modalities (text, images, audio, video) for broad instruction following
Function: Converts internal perception and memory into grounded decisions, explanations, and plans that guide downstream generation or control

Framework Architecture Overview

OpenWorldLib unifies these capabilities through modular components:

Key Insight: The paper establishes that while current world models excel at next-frame prediction, significant challenges remain in maintaining physical consistency during long-horizon interactions and balancing generation speed with quality across video, 3D, and embodied action tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

138

Get this paper in your agent:

hf papers read 2604.04707

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04707 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04707 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04707 in a Space README.md to link it from this page.

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Abstract

Community

1. Interactive Video Generation Results

2. 3D Generation Results

3. Vision-Language-Action (VLA) Generation Results

4. Multimodal Reasoning Capabilities

Framework Architecture Overview

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 4