daVinci-MagiHuman

Runtime error

App Files Files Community

daVinci-MagiHuman / README.md

jiadisu

doc: update paper link

7c8ec6f 2 months ago

preview code

raw

history blame contribute delete

7.38 kB

	---
	title: daVinci-MagiHuman
	emoji: 🎬
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.23.0
	app_port: 7860
	---

	<div align="center">

	# daVinci-MagiHuman

	### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

	<p align="center">
	<a href="https://www.sjtu.edu.cn/">SII-GAIR</a>  &  <a href="https://sand.ai">Sand.ai</a>
	</p>

	[![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2603.21986)
	[![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Demo-HuggingFace-orange)](https://huggingface.co/spaces/SII-GAIR/daVinci-MagiHuman)
	[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-HuggingFace-yellow)](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman)
	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Python](https://img.shields.io/badge/Python-3.12%2B-blue.svg)](https://www.python.org/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.9%2B-ee4c2c.svg)](https://pytorch.org/)

	</div>

	## Highlights

	- Single-Stream Transformer — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
	- Exceptional Human-Centric Quality — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
	- Multilingual — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
	- Blazing Fast Inference — Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
	- State-of-the-Art Results — Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
	- Fully Open Source — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.

	## Demo

	<!--
	To add demo videos:
	1. Open a GitHub issue on this repo
	2. Drag & drop your .mp4 files into the issue comment box
	3. Copy the generated URLs and paste them below

	Example:
	https://github.com/user-attachments/assets/xxxx-xxxx
	-->

	https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_1

	https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_2

	https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_3

	## Architecture

	<div align="center">
	<img src="assets/architecture.png" width="90%">
	</div>

	daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.

	Key design choices:

	\| Component \| Description \|
	\|---\|---\|
	\| Sandwich Architecture \| First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities \|
	\| Timestep-Free Denoising \| No explicit timestep embeddings — the model infers the denoising state directly from input latents \|
	\| Per-Head Gating \| Learned scalar gates with sigmoid activation on each attention head for training stability \|
	\| Unified Conditioning \| Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches \|

	## Performance

	### Quantitative Quality Benchmark

	\| Model \| Visual Quality ↑ \| Text Alignment ↑ \| Physical Consistency ↑ \| WER ↓ \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| OVI 1.1 \| 4.73 \| 4.10 \| 4.41 \| 40.45% \|
	\| LTX 2.3 \| 4.76 \| 4.12 \| 4.56 \| 19.23% \|
	\| daVinci-MagiHuman \| 4.80 \| 4.18 \| 4.52 \| 14.60% \|

	### Human Evaluation (2,000 Pairwise Comparisons)

	\| Matchup \| daVinci-MagiHuman Win \| Tie \| Opponent Win \|
	\|---\|:---:\|:---:\|:---:\|
	\| vs Ovi 1.1 \| 80.0% \| 8.2% \| 11.8% \|
	\| vs LTX 2.3 \| 60.9% \| 17.2% \| 21.9% \|

	### Inference Speed (Single H100 GPU, 5-second video)

	\| Resolution \| Base (s) \| Super-Res (s) \| Decode (s) \| Total (s) \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| 256p \| 1.6 \| — \| 0.4 \| 2.0 \|
	\| 540p \| 1.6 \| 5.1 \| 1.3 \| 8.0 \|
	\| 1080p \| 1.6 \| 31.0 \| 5.8 \| 38.4 \|

	## Efficient Inference Techniques

	- Latent-Space Super-Resolution — Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
	- Turbo VAE Decoder — A lightweight re-trained decoder that substantially reduces decoding overhead.
	- Full-Graph Compilation — [MagiCompiler](https://github.com/sandai/MagiCompiler) fuses operators across Transformer layers for ~1.2x speedup.
	- Distillation — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.

	## Getting Started

	### Option 1: Docker (Recommended)

	```bash
	# Pull the MagiCompiler Docker image
	docker pull sandai/magi-compiler:latest

	# Launch container
	docker run -it --gpus all \
	-v /path/to/models:/models \
	sandai/magi-compiler:latest bash

	# Install MagiCompiler
	git clone https://github.com/sandai/MagiCompiler
	cd MagiCompiler
	pip install -e . --no-build-isolation --config-settings editable_mode=compat
	cd ..

	# Clone daVinci-MagiHuman
	git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
	cd daVinci-MagiHuman
	```

	### Option 2: Conda

	```bash
	# Create environment
	conda create -n davinci python=3.12
	conda activate davinci

	# Install PyTorch
	pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0

	# Install Flash Attention (Hopper)
	git clone https://github.com/Dao-AILab/flash-attention
	cd flash-attention/hopper && python setup.py install && cd ../..

	# Install MagiCompiler
	git clone https://github.com/sandai/MagiCompiler
	cd MagiCompiler
	pip install -e . --no-build-isolation --config-settings editable_mode=compat
	cd ..

	# Clone and install daVinci-MagiHuman
	git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
	cd daVinci-MagiHuman
	pip install -r requirements.txt
	```

	### Download Model Checkpoints

	Download the complete model stack from [HuggingFace](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman) and update the paths in the config files under `example/`.

	## Usage

	Before running, update the checkpoint paths in the config files (`example/*/config.json`) to point to your local model directory.

	Base Model (256p)
	```bash
	bash example/base/run.sh
	```

	Distilled Model (256p, 8 steps, no CFG)
	```bash
	bash example/distill/run.sh
	```

	Super-Resolution to 540p
	```bash
	bash example/sr_540p/run.sh
	```

	Super-Resolution to 1080p
	```bash
	bash example/sr_1080p/run.sh
	```

	## Citation

	```bibtex
	@misc{davinci-magihuman-2025,
	title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
	author = {SII-GAIR and Sand.ai},
	year = {2025},
	url = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
	}
	```

	## Acknowledgements

	daVinci-MagiHuman builds upon several outstanding open-source projects, including [Wan2.2](https://github.com/Wan-Video/Wan2.2), [Flash Attention](https://github.com/Dao-AILab/flash-attention), and [Turbo-VAED](https://github.com/zou-group/turbo-vaed). We thank the broader open-source community for making this work possible.

	## License

	This project is released under the [Apache License 2.0](https://opensource.org/licenses/Apache-2.0).