| --- |
| license: apache-2.0 |
| pipeline_tag: image-to-video |
| --- |
| |
| # MultiWorld: Scalable Multi-Agent Multi-View Video World Models |
|
|
| MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It is modeled as an action-conditioned video generation model that takes historical frames and current actions as input to predict future frames. |
|
|
| - **Paper:** [MultiWorld: Scalable Multi-Agent Multi-View Video World Models](https://huggingface.co/papers/2604.18564) |
| - **Project Page:** [https://multi-world.github.io/](https://multi-world.github.io/) |
| - **GitHub Repository:** [https://github.com/CIntellifusion/MultiWorld](https://github.com/CIntellifusion/MultiWorld) |
|
|
| ## Overview |
|
|
| MultiWorld introduces two key components: |
| 1. **Multi-Agent Condition Module**: Employs Agent Identity Embedding and Adaptive Action Weighting to achieve precise multi-agent controllability. |
| 2. **Global State Encoder**: Uses a frozen VGGT backbone to extract implicit 3D global environmental information, ensuring multi-view consistency. |
|
|
| The model scales effectively across varying agent counts and camera views, supporting autoregressive inference to generate video sequences beyond the training context length. |
|
|
| ## Setup and Usage |
|
|
| ### Environment Setup |
|
|
| ```bash |
| conda create -n multiworld python=3.13 |
| conda activate multiworld |
| # install torch |
| pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \ |
| --index-url https://download.pytorch.org/whl/cu128 |
| |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Inference Example |
|
|
| To run inference on the "It Takes Two" game dataset: |
|
|
| ```bash |
| python -m torch.distributed.run --nproc_per_node=8 \ |
| ittakestwo/parallel_inference.py \ |
| --inference-seed 0 \ |
| --num-inference-steps 50 \ |
| --config-path ittakestwo/configs/inference_480P_full.yaml \ |
| --model-path <path_to_model_checkpoint> \ |
| --output-dir outputs/eval_480P_full |
| ``` |
|
|
| For robotics tasks: |
|
|
| ```bash |
| python -m torch.distributed.run --nproc_per_node=8 \ |
| robots/parallel_inference.py \ |
| --config-path robots/configs/inference.yaml \ |
| --model-path <path_to_model_checkpoint> \ |
| --output-dir outputs/test_robotics_output |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{wu2025multiworld, |
| title={MultiWorld: Scalable Multi-Agent Multi-View Video World Models}, |
| author={Wu, Haoyu and Yu, Jiwen and Zou, Yingtian and Liu, Xihui}, |
| journal={arXiv preprint arXiv:2604.18564}, |
| year={2026} |
| } |
| ``` |