daVinci-MagiHuman / README.md
jiadisu
doc: update paper link
7c8ec6f
---
title: daVinci-MagiHuman
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_port: 7860
---
<div align="center">
# daVinci-MagiHuman
### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
<p align="center">
<a href="https://www.sjtu.edu.cn/">SII-GAIR</a> &nbsp;&amp;&nbsp; <a href="https://sand.ai">Sand.ai</a>
</p>
[![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2603.21986)
[![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Demo-HuggingFace-orange)](https://huggingface.co/spaces/SII-GAIR/daVinci-MagiHuman)
[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-HuggingFace-yellow)](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python](https://img.shields.io/badge/Python-3.12%2B-blue.svg)](https://www.python.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.9%2B-ee4c2c.svg)](https://pytorch.org/)
</div>
## Highlights
- **Single-Stream Transformer** — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
- **Exceptional Human-Centric Quality** — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
- **Multilingual** — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
- **Blazing Fast Inference** — Generates a 5-second 256p video in **2 seconds** and a 5-second 1080p video in **38 seconds** on a single H100 GPU.
- **State-of-the-Art Results** — Achieves **80.0%** win rate vs Ovi 1.1 and **60.9%** vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
- **Fully Open Source** — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.
## Demo
<!--
To add demo videos:
1. Open a GitHub issue on this repo
2. Drag & drop your .mp4 files into the issue comment box
3. Copy the generated URLs and paste them below
Example:
https://github.com/user-attachments/assets/xxxx-xxxx
-->
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_1
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_2
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_3
## Architecture
<div align="center">
<img src="assets/architecture.png" width="90%">
</div>
daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.
Key design choices:
| Component | Description |
|---|---|
| **Sandwich Architecture** | First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities |
| **Timestep-Free Denoising** | No explicit timestep embeddings — the model infers the denoising state directly from input latents |
| **Per-Head Gating** | Learned scalar gates with sigmoid activation on each attention head for training stability |
| **Unified Conditioning** | Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches |
## Performance
### Quantitative Quality Benchmark
| Model | Visual Quality ↑ | Text Alignment ↑ | Physical Consistency ↑ | WER ↓ |
|---|:---:|:---:|:---:|:---:|
| OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% |
| LTX 2.3 | 4.76 | 4.12 | **4.56** | 19.23% |
| **daVinci-MagiHuman** | **4.80** | **4.18** | 4.52 | **14.60%** |
### Human Evaluation (2,000 Pairwise Comparisons)
| Matchup | daVinci-MagiHuman Win | Tie | Opponent Win |
|---|:---:|:---:|:---:|
| vs Ovi 1.1 | **80.0%** | 8.2% | 11.8% |
| vs LTX 2.3 | **60.9%** | 17.2% | 21.9% |
### Inference Speed (Single H100 GPU, 5-second video)
| Resolution | Base (s) | Super-Res (s) | Decode (s) | **Total (s)** |
|---|:---:|:---:|:---:|:---:|
| 256p | 1.6 | — | 0.4 | **2.0** |
| 540p | 1.6 | 5.1 | 1.3 | **8.0** |
| 1080p | 1.6 | 31.0 | 5.8 | **38.4** |
## Efficient Inference Techniques
- **Latent-Space Super-Resolution** — Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
- **Turbo VAE Decoder** — A lightweight re-trained decoder that substantially reduces decoding overhead.
- **Full-Graph Compilation** — [MagiCompiler](https://github.com/sandai/MagiCompiler) fuses operators across Transformer layers for ~1.2x speedup.
- **Distillation** — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.
## Getting Started
### Option 1: Docker (Recommended)
```bash
# Pull the MagiCompiler Docker image
docker pull sandai/magi-compiler:latest
# Launch container
docker run -it --gpus all \
-v /path/to/models:/models \
sandai/magi-compiler:latest bash
# Install MagiCompiler
git clone https://github.com/sandai/MagiCompiler
cd MagiCompiler
pip install -e . --no-build-isolation --config-settings editable_mode=compat
cd ..
# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
```
### Option 2: Conda
```bash
# Create environment
conda create -n davinci python=3.12
conda activate davinci
# Install PyTorch
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0
# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..
# Install MagiCompiler
git clone https://github.com/sandai/MagiCompiler
cd MagiCompiler
pip install -e . --no-build-isolation --config-settings editable_mode=compat
cd ..
# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
```
### Download Model Checkpoints
Download the complete model stack from [HuggingFace](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman) and update the paths in the config files under `example/`.
## Usage
Before running, update the checkpoint paths in the config files (`example/*/config.json`) to point to your local model directory.
**Base Model (256p)**
```bash
bash example/base/run.sh
```
**Distilled Model (256p, 8 steps, no CFG)**
```bash
bash example/distill/run.sh
```
**Super-Resolution to 540p**
```bash
bash example/sr_540p/run.sh
```
**Super-Resolution to 1080p**
```bash
bash example/sr_1080p/run.sh
```
## Citation
```bibtex
@misc{davinci-magihuman-2025,
title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
year = {2025},
url = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}
```
## Acknowledgements
daVinci-MagiHuman builds upon several outstanding open-source projects, including [Wan2.2](https://github.com/Wan-Video/Wan2.2), [Flash Attention](https://github.com/Dao-AILab/flash-attention), and [Turbo-VAED](https://github.com/zou-group/turbo-vaed). We thank the broader open-source community for making this work possible.
## License
This project is released under the [Apache License 2.0](https://opensource.org/licenses/Apache-2.0).