Papers
arxiv:2604.04707

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Published on Apr 6
· Submitted by
taesiri
on Apr 7
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

OpenWorldLib presents a standardized framework for advanced world models that integrate perception, interaction, and long-term memory capabilities for comprehensive world understanding and prediction.

AI-generated summary

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

Community

Paper author

Hello everyone, welcome to follow our work. Given the current diversity of research on world models, we aim to provide a unified definition and calling standard for world models, establishing a clear boundary for this direction. If you are interested, or would like to promote your own work related to world models, please feel free to raise an issue in our code link: https://github.com/OpenDCAI/OpenWorldLib .

Paper author

One thing to note: because we aim to cover as many methods as possible, the environment is relatively complex. This codebase primarily supports inference for different world model tasks. For training, reward settings, and similar aspects, this project does not currently support them. In our next project, we will focus on training and optimizing the lightest and most effective model for each task.

good job!

·
Paper author

Thank you very much!

Based on the paper "OpenWorldLib: A Unified Codebase and Definition of Advanced World Models", here are the main results explained:

1. Interactive Video Generation Results

The evaluation covers navigation video generation (camera movement) and interactive video generation (physical interactions). Key findings include:

  • Matrix-Game-2: Offers fast generation speeds but suffers from noticeable color shifting during long-horizon generation
  • Lingbot-World, Hunyuan-GameCraft, and YUME-1.5: Successfully support high-quality navigation video generation
  • Hunyuan-WorldPlay: Achieves the best overall visual performance for navigation video generation
  • Wan-IT2V: Can execute basic interactive generation but struggles with maintaining physical consistency
  • WoW (World Omniscient World Model): Supports diverse functionalities but has significantly inferior generation quality and physical realism compared to Cosmos

Figure 4: Demonstration of interactive video generation results showing navigation and interactive scenarios

2. 3D Generation Results

The 3D generation pipeline supports scene reconstruction with movement controls and camera viewpoint adjustments:

  • VGGT and InfiniteVGGT: Can generate 3D scenes from different views but show geometric inconsistency and texture blurring in complex areas when the camera moves significantly
  • FlashWorld: Provides faster generation but balancing steady shapes with sharp details remains a major challenge
  • Despite limitations, 3D generation remains crucial for realistic physical simulation in world models

Figure 5: Demonstration of 3D scene generation results

3. Vision-Language-Action (VLA) Generation Results

The framework evaluates embodied AI through two simulation paradigms:

  • AI2-THOR: Used for embodied video generation with photorealistic scene rendering
  • LIBERO: Used for VLA evaluation with physically grounded manipulation environments

Key models evaluated:

  • π₀ and π₀.₅: Leverage PaliGemma vision-language backbone with mixture-of-experts (MoE) action heads for robust multi-task generalization
  • LingBot-VA: Approaches tasks from a generative perspective using video diffusion architecture to jointly model visual future predictions and continuous action synthesis

Figure 6: Demonstration of simulator generation results from LIBERO and AI2-THOR environments showing manipulation tasks

4. Multimodal Reasoning Capabilities

The Reasoning module demonstrates:

  • Spatial reasoning: Geometry-centric queries, object relations, and step-by-step spatial deductions from visual inputs
  • Omni/general reasoning: Operating over mixed modalities (text, images, audio, video) for broad instruction following
  • Function: Converts internal perception and memory into grounded decisions, explanations, and plans that guide downstream generation or control

Framework Architecture Overview

OpenWorldLib unifies these capabilities through modular components:

Figure 1: Overview of OpenWorldLib framework encompassing perception, understanding, memory, and generation

Figure 2: Detailed illustration of the OpenWorldLib framework showing Operator, Synthesis, Reasoning, Representation, Memory modules and Pipeline

Key Insight: The paper establishes that while current world models excel at next-frame prediction, significant challenges remain in maintaining physical consistency during long-horizon interactions and balancing generation speed with quality across video, 3D, and embodied action tasks.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.04707
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04707 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04707 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04707 in a Space README.md to link it from this page.

Collections including this paper 4