File size: 2,088 Bytes
8777071
dd061bb
 
 
 
 
8777071
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
pipeline_tag: image-to-video
language:
- en
base_model:
- Wan-AI/Wan2.1-I2V-14B-480P
---

# LongVie 2: Multimodal Controllable Ultra-Long Video World Model

LongVie 2 is a multimodal controllable world model for generating ultra-long videos with depth and pointmap control signals, as presented in the paper [LongVie 2: Multimodal Controllable Ultra-Long Video World Model](https://huggingface.co/papers/2512.13604). It is an end-to-end autoregressive framework trained to enhance controllability, long-term visual quality, and temporal consistency.

- πŸ“ [Paper on Hugging Face](https://huggingface.co/papers/2512.13604)
- 🌐 [Project Page](https://vchitect.github.io/LongVie2-project/)
- πŸ’» [GitHub Repository](https://github.com/Vchitect/LongVie)


## πŸš€ Quick Start

### Installation
To get started with LongVie 2, follow the installation steps from the GitHub repository:

```bash
conda create -n longvie python=3.10 -y
conda activate longvie
conda install psutil
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.7.2.post1
cd LongVie
pip install -e .
```

### Download Weights
1. Download the base model `Wan2.1-I2V-14B-480P`:
```bash
python download_wan2.1.py
```

2. Download the [LongVie2 weights](https://huggingface.co/Vchitect/LongVie2) and place them in `./model/LongVie/`

### Inference
Generate a 5s video clip (~8-9 mins on a single A100 GPU) using the following command:
```bash
bash sample_longvideo.sh
```

## πŸ“„ Citation

If you find this work useful, please consider citing:
```bibtex
@misc{gao2025longvie2,
  title={LongVie 2: Multimodal Controllable Ultra-Long Video World Model}, 
  author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Junhao Zhuang and Chengming Xu and Jianfeng Feng and Yu Qiao and Yanwei Fu and Chenyang Si and Ziwei Liu},
  year={2025},
  eprint={2512.13604},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.13604}, 
}
```