File size: 4,339 Bytes
3d1ae71
b40a95d
 
 
3d1ae71
b40a95d
3d1ae71
b40a95d
 
 
 
 
3d1ae71
13a77b0
3d1ae71
 
 
13a77b0
3d1ae71
 
 
13a77b0
3d1ae71
56335b1
3d1ae71
 
 
13a77b0
b40a95d
 
13a77b0
 
56335b1
82a06bd
51c482c
13a77b0
 
 
b40a95d
 
 
 
13a77b0
b40a95d
13a77b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97a90b9
13a77b0
b40a95d
13a77b0
97a90b9
8e7aa5b
97a90b9
 
13a77b0
b40a95d
13a77b0
97a90b9
 
 
b40a95d
97a90b9
13a77b0
 
 
b40a95d
13a77b0
 
 
 
 
 
 
 
 
b40a95d
13a77b0
 
 
 
 
b40a95d
13a77b0
 
 
 
56335b1
 
 
db46637
56335b1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
base_model:
- Qwen/Qwen3-VL-8B-Instruct
- Wan-AI/Wan2.2-TI2V-5B
language:
- en
tags:
- video-generation
- video-editing
- multi-modal
- diffusion
pipeline_tag: text-to-video
---

<p align="center">
  <b style="font-size:1.8em;">LoomVideo: Unifying Multimodal Inputs into<br>Video Generation and Editing</b>
</p>

<p align="center">
  <b>Peking University &middot; Alibaba Group</b>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2606.06042" target="_blank"><img src="https://img.shields.io/badge/Paper-b5212f.svg?logo=arxiv" height="22px"></a>
  <a href="https://github.com/MSALab-PKU/LoomVideo" target="_blank"><img src="https://img.shields.io/badge/GitHub-bb8a2e.svg?logo=github" height="22px"></a>
  <a href="https://msalab-pku.github.io/projects/LoomVideo/index.html" target="_blank"><img src="https://img.shields.io/badge/Project%20Page-333399.svg?logo=homepage" height="22px"></a>
</p>

This repository contains the weights for **LoomVideo**, a compact 5B-parameter unified architecture for both video generation and editing. For more details, see the paper: [LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing](https://arxiv.org/abs/2606.06042).

# πŸ”₯ News

- [2026-06-05] We release LoomVideo [paper](https://arxiv.org/abs/2606.06042) on Arxiv!
- [2026-06-02] We release the [codebase](https://github.com/MSALab-PKU/LoomVideo) and [model weights](https://huggingface.co/MSALab/LoomVideo) of LoomVideo!
- [2026-06-02] We release the [project page](https://msalab-pku.github.io/projects/LoomVideo/index.html) of LoomVideo!

# πŸ“Œ TL;DR

LoomVideo is a compact **5B-parameter** unified architecture built on MLLM + DiT that introduces three key designs:
- **Deepstack Injection** β€” extracts features from every MLLM layer and injects them into corresponding DiT layers via cross-attention.
- **Scale-and-Add Conditioning** β€” a zero-overhead approach for video editing that eliminates the need for token concatenation.
- **Negative Temporal RoPE** β€” seamlessly integrates multiple reference images without architectural modification.

Our 5B model achieves state-of-the-art performance across benchmarks, with at least **5.41Γ—** inference speedup over models of similar capabilities.

<p align="center">
  <img src="assets/architecture.png" width="90%">
</p>

# 🎯 Supported Tasks

LoomVideo supports **four** unified video generation and editing tasks within a single model:

| Task | Input | Output | Description |
|:-----|:------|:-------|:------------|
| **Text-to-Video** | Text πŸ“ | Video 🎬 | Generate a video from a text prompt |
| **Instruction Editing** | Video 🎬 + Text πŸ“ | Video 🎬 | Edit a video following text instructions |
| **Instruction-Image Editing** | Video 🎬 + Image πŸ–Ό + Text πŸ“ | Video 🎬 | Edit a video with a reference image as guidance |
| **Multi-Image-to-Video** | Images πŸ–Ό + Text πŸ“ | Video 🎬 | Compose multiple reference images into a coherent video |

# πŸ”§ Preparation

### 1. Clone the Repository

```bash
git clone https://github.com/MSALab-PKU/LoomVideo
cd LoomVideo
```

### 2. Install Dependencies

```bash
uv sync
source .venv/bin/activate
pip install flash-attn --no-build-isolation
```

# 🎬 Inference

LoomVideo provides a unified inference script. Below is an example for **Text-to-Video** generation. For other tasks (editing, reference-guided editing), please refer to the [GitHub README](https://github.com/MSALab-PKU/LoomVideo).

```bash
NUM_GPUS=1

accelerate launch --num_processes=${NUM_GPUS} \
    scripts/inference/generate.py \
    --config_path configs/inference/generation.yaml \
    --ckpt_path checkpoints/LoomVideo \
    --task t2v \
    --prompt "Vampire makeup face of beautiful girl, red contact lenses." \
    --height 480 \
    --width 832 \
    --num_frames 97 \
    --num_inference_steps 50 \
    --seed 0 \
    --output_path outputs/t2v_demo.mp4
```

# πŸ“„ Citation

```bibtex
@article{wu2026loomvideo,
  title={LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing},
  author={Wu, Jianzong and Lian, Hao and Yang, Jiongfan and Hao, Dachao and Tian, Ye and Tong, Yunhai and Zhu, Jingyuan and Chen, Biaolong and Qi, Qiaosong and Zhang, Aixi and He, Wanggui and Liu, Mushui and Huang, Pipei and Jiang, Hao},
  journal={arXiv preprint arXiv:2606.06042},
  year={2026}
}
```