File size: 2,492 Bytes
dc1444d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: apache-2.0
pipeline_tag: image-to-video
---

# MultiWorld: Scalable Multi-Agent Multi-View Video World Models

MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It is modeled as an action-conditioned video generation model that takes historical frames and current actions as input to predict future frames.

- **Paper:** [MultiWorld: Scalable Multi-Agent Multi-View Video World Models](https://huggingface.co/papers/2604.18564)
- **Project Page:** [https://multi-world.github.io/](https://multi-world.github.io/)
- **GitHub Repository:** [https://github.com/CIntellifusion/MultiWorld](https://github.com/CIntellifusion/MultiWorld)

## Overview

MultiWorld introduces two key components:
1. **Multi-Agent Condition Module**: Employs Agent Identity Embedding and Adaptive Action Weighting to achieve precise multi-agent controllability.
2. **Global State Encoder**: Uses a frozen VGGT backbone to extract implicit 3D global environmental information, ensuring multi-view consistency.

The model scales effectively across varying agent counts and camera views, supporting autoregressive inference to generate video sequences beyond the training context length.

## Setup and Usage

### Environment Setup

```bash
conda create -n multiworld python=3.13 
conda activate multiworld
# install torch 
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

pip install -r requirements.txt
```

### Inference Example

To run inference on the "It Takes Two" game dataset:

```bash
python -m torch.distributed.run --nproc_per_node=8 \
    ittakestwo/parallel_inference.py \
    --inference-seed 0 \
    --num-inference-steps 50 \
    --config-path ittakestwo/configs/inference_480P_full.yaml \
    --model-path <path_to_model_checkpoint> \
    --output-dir outputs/eval_480P_full 
```

For robotics tasks:

```bash
python -m torch.distributed.run --nproc_per_node=8 \
    robots/parallel_inference.py \
    --config-path robots/configs/inference.yaml \
    --model-path <path_to_model_checkpoint> \
    --output-dir outputs/test_robotics_output 
```

## Citation

```bibtex
@article{wu2025multiworld,
  title={MultiWorld: Scalable Multi-Agent Multi-View Video World Models},
  author={Wu, Haoyu and Yu, Jiwen and Zou, Yingtian and Liu, Xihui},
  journal={arXiv preprint arXiv:2604.18564},
  year={2026}
}
```