LoomVideo / README.md

nielsr HF Staff

Add pipeline tag and improve model card

b40a95d verified 4 days ago

4.34 kB

base_model:
  - Qwen/Qwen3-VL-8B-Instruct
  - Wan-AI/Wan2.2-TI2V-5B
language:
  - en
tags:
  - video-generation
  - video-editing
  - multi-modal
  - diffusion
pipeline_tag: text-to-video

LoomVideo: Unifying Multimodal Inputs into
Video Generation and Editing

Peking University · Alibaba Group

This repository contains the weights for LoomVideo, a compact 5B-parameter unified architecture for both video generation and editing. For more details, see the paper: LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing.

🔥 News

[2026-06-05] We release LoomVideo paper on Arxiv!
[2026-06-02] We release the codebase and model weights of LoomVideo!
[2026-06-02] We release the project page of LoomVideo!

📌 TL;DR

LoomVideo is a compact 5B-parameter unified architecture built on MLLM + DiT that introduces three key designs:

Deepstack Injection — extracts features from every MLLM layer and injects them into corresponding DiT layers via cross-attention.
Scale-and-Add Conditioning — a zero-overhead approach for video editing that eliminates the need for token concatenation.
Negative Temporal RoPE — seamlessly integrates multiple reference images without architectural modification.

Our 5B model achieves state-of-the-art performance across benchmarks, with at least 5.41× inference speedup over models of similar capabilities.

🎯 Supported Tasks

LoomVideo supports four unified video generation and editing tasks within a single model:

Task	Input	Output	Description
Text-to-Video	Text 📝	Video 🎬	Generate a video from a text prompt
Instruction Editing	Video 🎬 + Text 📝	Video 🎬	Edit a video following text instructions
Instruction-Image Editing	Video 🎬 + Image 🖼 + Text 📝	Video 🎬	Edit a video with a reference image as guidance
Multi-Image-to-Video	Images 🖼 + Text 📝	Video 🎬	Compose multiple reference images into a coherent video

🔧 Preparation

1. Clone the Repository

git clone https://github.com/MSALab-PKU/LoomVideo
cd LoomVideo

2. Install Dependencies

uv sync
source .venv/bin/activate
pip install flash-attn --no-build-isolation

🎬 Inference

LoomVideo provides a unified inference script. Below is an example for Text-to-Video generation. For other tasks (editing, reference-guided editing), please refer to the GitHub README.

NUM_GPUS=1

accelerate launch --num_processes=${NUM_GPUS} \
    scripts/inference/generate.py \
    --config_path configs/inference/generation.yaml \
    --ckpt_path checkpoints/LoomVideo \
    --task t2v \
    --prompt "Vampire makeup face of beautiful girl, red contact lenses." \
    --height 480 \
    --width 832 \
    --num_frames 97 \
    --num_inference_steps 50 \
    --seed 0 \
    --output_path outputs/t2v_demo.mp4

📄 Citation

@article{wu2026loomvideo,
  title={LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing},
  author={Wu, Jianzong and Lian, Hao and Yang, Jiongfan and Hao, Dachao and Tian, Ye and Tong, Yunhai and Zhu, Jingyuan and Chen, Biaolong and Qi, Qiaosong and Zhang, Aixi and He, Wanggui and Liu, Mushui and Huang, Pipei and Jiang, Hao},
  journal={arXiv preprint arXiv:2606.06042},
  year={2026}
}