Spaces:
Runtime error
Runtime error
| title: daVinci-MagiHuman | |
| emoji: 🎬 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.23.0 | |
| app_port: 7860 | |
| <div align="center"> | |
| # daVinci-MagiHuman | |
| ### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model | |
| <p align="center"> | |
| <a href="https://www.sjtu.edu.cn/">SII-GAIR</a> & <a href="https://sand.ai">Sand.ai</a> | |
| </p> | |
| [](https://arxiv.org/abs/2603.21986) | |
| [](https://huggingface.co/spaces/SII-GAIR/daVinci-MagiHuman) | |
| [](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman) | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| [](https://www.python.org/) | |
| [](https://pytorch.org/) | |
| </div> | |
| ## Highlights | |
| - **Single-Stream Transformer** — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity. | |
| - **Exceptional Human-Centric Quality** — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization. | |
| - **Multilingual** — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French. | |
| - **Blazing Fast Inference** — Generates a 5-second 256p video in **2 seconds** and a 5-second 1080p video in **38 seconds** on a single H100 GPU. | |
| - **State-of-the-Art Results** — Achieves **80.0%** win rate vs Ovi 1.1 and **60.9%** vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons. | |
| - **Fully Open Source** — We release the complete model stack: base model, distilled model, super-resolution model, and inference code. | |
| ## Demo | |
| <!-- | |
| To add demo videos: | |
| 1. Open a GitHub issue on this repo | |
| 2. Drag & drop your .mp4 files into the issue comment box | |
| 3. Copy the generated URLs and paste them below | |
| Example: | |
| https://github.com/user-attachments/assets/xxxx-xxxx | |
| --> | |
| https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_1 | |
| https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_2 | |
| https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_3 | |
| ## Architecture | |
| <div align="center"> | |
| <img src="assets/architecture.png" width="90%"> | |
| </div> | |
| daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence. | |
| Key design choices: | |
| | Component | Description | | |
| |---|---| | |
| | **Sandwich Architecture** | First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities | | |
| | **Timestep-Free Denoising** | No explicit timestep embeddings — the model infers the denoising state directly from input latents | | |
| | **Per-Head Gating** | Learned scalar gates with sigmoid activation on each attention head for training stability | | |
| | **Unified Conditioning** | Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches | | |
| ## Performance | |
| ### Quantitative Quality Benchmark | |
| | Model | Visual Quality ↑ | Text Alignment ↑ | Physical Consistency ↑ | WER ↓ | | |
| |---|:---:|:---:|:---:|:---:| | |
| | OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% | | |
| | LTX 2.3 | 4.76 | 4.12 | **4.56** | 19.23% | | |
| | **daVinci-MagiHuman** | **4.80** | **4.18** | 4.52 | **14.60%** | | |
| ### Human Evaluation (2,000 Pairwise Comparisons) | |
| | Matchup | daVinci-MagiHuman Win | Tie | Opponent Win | | |
| |---|:---:|:---:|:---:| | |
| | vs Ovi 1.1 | **80.0%** | 8.2% | 11.8% | | |
| | vs LTX 2.3 | **60.9%** | 17.2% | 21.9% | | |
| ### Inference Speed (Single H100 GPU, 5-second video) | |
| | Resolution | Base (s) | Super-Res (s) | Decode (s) | **Total (s)** | | |
| |---|:---:|:---:|:---:|:---:| | |
| | 256p | 1.6 | — | 0.4 | **2.0** | | |
| | 540p | 1.6 | 5.1 | 1.3 | **8.0** | | |
| | 1080p | 1.6 | 31.0 | 5.8 | **38.4** | | |
| ## Efficient Inference Techniques | |
| - **Latent-Space Super-Resolution** — Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip. | |
| - **Turbo VAE Decoder** — A lightweight re-trained decoder that substantially reduces decoding overhead. | |
| - **Full-Graph Compilation** — [MagiCompiler](https://github.com/sandai/MagiCompiler) fuses operators across Transformer layers for ~1.2x speedup. | |
| - **Distillation** — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality. | |
| ## Getting Started | |
| ### Option 1: Docker (Recommended) | |
| ```bash | |
| # Pull the MagiCompiler Docker image | |
| docker pull sandai/magi-compiler:latest | |
| # Launch container | |
| docker run -it --gpus all \ | |
| -v /path/to/models:/models \ | |
| sandai/magi-compiler:latest bash | |
| # Install MagiCompiler | |
| git clone https://github.com/sandai/MagiCompiler | |
| cd MagiCompiler | |
| pip install -e . --no-build-isolation --config-settings editable_mode=compat | |
| cd .. | |
| # Clone daVinci-MagiHuman | |
| git clone https://github.com/GAIR-NLP/daVinci-MagiHuman | |
| cd daVinci-MagiHuman | |
| ``` | |
| ### Option 2: Conda | |
| ```bash | |
| # Create environment | |
| conda create -n davinci python=3.12 | |
| conda activate davinci | |
| # Install PyTorch | |
| pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 | |
| # Install Flash Attention (Hopper) | |
| git clone https://github.com/Dao-AILab/flash-attention | |
| cd flash-attention/hopper && python setup.py install && cd ../.. | |
| # Install MagiCompiler | |
| git clone https://github.com/sandai/MagiCompiler | |
| cd MagiCompiler | |
| pip install -e . --no-build-isolation --config-settings editable_mode=compat | |
| cd .. | |
| # Clone and install daVinci-MagiHuman | |
| git clone https://github.com/GAIR-NLP/daVinci-MagiHuman | |
| cd daVinci-MagiHuman | |
| pip install -r requirements.txt | |
| ``` | |
| ### Download Model Checkpoints | |
| Download the complete model stack from [HuggingFace](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman) and update the paths in the config files under `example/`. | |
| ## Usage | |
| Before running, update the checkpoint paths in the config files (`example/*/config.json`) to point to your local model directory. | |
| **Base Model (256p)** | |
| ```bash | |
| bash example/base/run.sh | |
| ``` | |
| **Distilled Model (256p, 8 steps, no CFG)** | |
| ```bash | |
| bash example/distill/run.sh | |
| ``` | |
| **Super-Resolution to 540p** | |
| ```bash | |
| bash example/sr_540p/run.sh | |
| ``` | |
| **Super-Resolution to 1080p** | |
| ```bash | |
| bash example/sr_1080p/run.sh | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @misc{davinci-magihuman-2025, | |
| title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model}, | |
| author = {SII-GAIR and Sand.ai}, | |
| year = {2025}, | |
| url = {https://github.com/GAIR-NLP/daVinci-MagiHuman} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| daVinci-MagiHuman builds upon several outstanding open-source projects, including [Wan2.2](https://github.com/Wan-Video/Wan2.2), [Flash Attention](https://github.com/Dao-AILab/flash-attention), and [Turbo-VAED](https://github.com/zou-group/turbo-vaed). We thank the broader open-source community for making this work possible. | |
| ## License | |
| This project is released under the [Apache License 2.0](https://opensource.org/licenses/Apache-2.0). | |