nielsr's picture
nielsr HF Staff
Add model card and metadata
dc1444d verified
|
raw
history blame
2.49 kB
metadata
license: apache-2.0
pipeline_tag: image-to-video

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It is modeled as an action-conditioned video generation model that takes historical frames and current actions as input to predict future frames.

Overview

MultiWorld introduces two key components:

  1. Multi-Agent Condition Module: Employs Agent Identity Embedding and Adaptive Action Weighting to achieve precise multi-agent controllability.
  2. Global State Encoder: Uses a frozen VGGT backbone to extract implicit 3D global environmental information, ensuring multi-view consistency.

The model scales effectively across varying agent counts and camera views, supporting autoregressive inference to generate video sequences beyond the training context length.

Setup and Usage

Environment Setup

conda create -n multiworld python=3.13 
conda activate multiworld
# install torch 
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

pip install -r requirements.txt

Inference Example

To run inference on the "It Takes Two" game dataset:

python -m torch.distributed.run --nproc_per_node=8 \
    ittakestwo/parallel_inference.py \
    --inference-seed 0 \
    --num-inference-steps 50 \
    --config-path ittakestwo/configs/inference_480P_full.yaml \
    --model-path <path_to_model_checkpoint> \
    --output-dir outputs/eval_480P_full 

For robotics tasks:

python -m torch.distributed.run --nproc_per_node=8 \
    robots/parallel_inference.py \
    --config-path robots/configs/inference.yaml \
    --model-path <path_to_model_checkpoint> \
    --output-dir outputs/test_robotics_output 

Citation

@article{wu2025multiworld,
  title={MultiWorld: Scalable Multi-Agent Multi-View Video World Models},
  author={Wu, Haoyu and Yu, Jiwen and Zou, Yingtian and Liu, Xihui},
  journal={arXiv preprint arXiv:2604.18564},
  year={2026}
}