metadata
license: apache-2.0
pipeline_tag: image-to-video
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It is modeled as an action-conditioned video generation model that takes historical frames and current actions as input to predict future frames.
- Paper: MultiWorld: Scalable Multi-Agent Multi-View Video World Models
- Project Page: https://multi-world.github.io/
- GitHub Repository: https://github.com/CIntellifusion/MultiWorld
Overview
MultiWorld introduces two key components:
- Multi-Agent Condition Module: Employs Agent Identity Embedding and Adaptive Action Weighting to achieve precise multi-agent controllability.
- Global State Encoder: Uses a frozen VGGT backbone to extract implicit 3D global environmental information, ensuring multi-view consistency.
The model scales effectively across varying agent counts and camera views, supporting autoregressive inference to generate video sequences beyond the training context length.
Setup and Usage
Environment Setup
conda create -n multiworld python=3.13
conda activate multiworld
# install torch
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
--index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
Inference Example
To run inference on the "It Takes Two" game dataset:
python -m torch.distributed.run --nproc_per_node=8 \
ittakestwo/parallel_inference.py \
--inference-seed 0 \
--num-inference-steps 50 \
--config-path ittakestwo/configs/inference_480P_full.yaml \
--model-path <path_to_model_checkpoint> \
--output-dir outputs/eval_480P_full
For robotics tasks:
python -m torch.distributed.run --nproc_per_node=8 \
robots/parallel_inference.py \
--config-path robots/configs/inference.yaml \
--model-path <path_to_model_checkpoint> \
--output-dir outputs/test_robotics_output
Citation
@article{wu2025multiworld,
title={MultiWorld: Scalable Multi-Agent Multi-View Video World Models},
author={Wu, Haoyu and Yu, Jiwen and Zou, Yingtian and Liu, Xihui},
journal={arXiv preprint arXiv:2604.18564},
year={2026}
}