Highlights
- Single-Stream Transformer β A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
- Exceptional Human-Centric Quality β Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
- Multilingual β Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
- Blazing Fast Inference β Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
- State-of-the-Art Results β Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
- Fully Open Source β We release the complete model stack: base model, distilled model, super-resolution model, and inference code.
Architecture
daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.
Key design choices:
| Component |
Description |
| Sandwich Architecture |
First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities |
| Timestep-Free Denoising |
No explicit timestep embeddings β the model infers the denoising state directly from input latents |
| Per-Head Gating |
Learned scalar gates with sigmoid activation on each attention head for training stability |
| Unified Conditioning |
Denoising and reference signals handled through a minimal unified interface β no dedicated conditioning branches |
Performance
Quantitative Quality Benchmark
| Model |
Visual Quality β |
Text Alignment β |
Physical Consistency β |
WER β |
| OVI 1.1 |
4.73 |
4.10 |
4.41 |
40.45% |
| LTX 2.3 |
4.76 |
4.12 |
4.56 |
19.23% |
| daVinci-MagiHuman |
4.80 |
4.18 |
4.52 |
14.60% |
Human Evaluation (2,000 Pairwise Comparisons)
| Matchup |
daVinci-MagiHuman Win |
Tie |
Opponent Win |
| vs Ovi 1.1 |
80.0% |
8.2% |
11.8% |
| vs LTX 2.3 |
60.9% |
17.2% |
21.9% |
Inference Speed (5-second video)
| Resolution |
Base (s) |
Super-Res (s) |
Decode (s) |
Total (s) |
| 256p |
1.6 |
β |
0.4 |
2.0 |
| 540p |
1.6 |
5.1 |
1.3 |
8.0 |
| 1080p |
1.6 |
31.0 |
5.8 |
38.4 |
Efficient Inference Techniques
- Latent-Space Super-Resolution β Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
- Turbo VAE Decoder β A lightweight re-trained decoder that substantially reduces decoding overhead.
- Full-Graph Compilation β MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
- Distillation β DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.
Getting Started
Option 1: Docker (Recommended)
docker pull sandai/magi-compiler:latest
docker run -it --gpus all \
-v /path/to/models:/models \
sandai/magi-compiler:latest bash
git clone https://github.com/SandAI-org/MagiCompiler
cd MagiCompiler
pip install -e . --no-build-isolation --config-settings editable_mode=compat
cd ..
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
Option 2: Conda
conda create -n davinci python=3.12
conda activate davinci
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..
git clone https://github.com/SandAI-org/MagiCompiler
cd MagiCompiler
pip install -e . --no-build-isolation --config-settings editable_mode=compat
cd ..
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
Download Model Checkpoints
Download the complete model stack from HuggingFace and update the paths in the config files under example/.
Usage
Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.
Base Model (256p)
bash example/base/run.sh
Distilled Model (256p, 8 steps, no CFG)
bash example/distill/run.sh
Super-Resolution to 540p
bash example/sr_540p/run.sh
Super-Resolution to 1080p
bash example/sr_1080p/run.sh
Acknowledgements
We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.
License
This project is released under the Apache License 2.0.
Citation
@misc{davinci-magihuman-2026,
title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
year = {2026},
url = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}