Text-to-Video
Wan2.2
English
Chinese
custom
ti2v
text-to-audio-video
audio-video-generation
mmdit
flow-matching
Instructions to use baidu/NAVA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use baidu/NAVA with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 14,008 Bytes
b0b3bbb e4a8829 b0b3bbb e4a8829 a30b910 9344c13 2e8f1b7 9344c13 e4a8829 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | ---
license: apache-2.0
language:
- en
- zh
tags:
- text-to-video
- text-to-audio-video
- audio-video-generation
- mmdit
- flow-matching
- wan2.2
pipeline_tag: text-to-video
library_name: custom
base_model: Wan-AI/Wan2.2-TI2V-5B
---
<p align="center">
<img src="assets/logo.png" alt="NAVA" width="160">
</p>
<h1 align="center">NAVA β Native Audio-Visual Alignment for Generation</h1>
<p align="center">
<em>State-of-the-art audio-visual synchronization with only <b>6.3 B</b> parameters.</em>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2605.30073"><img alt="arXiv" src="https://img.shields.io/badge/Paper-arXiv-b31b1b.svg"></a>
<a href="https://github.com/ernie-research/NAVA"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-181717.svg"></a>
<a href="https://ernie-research.github.io/NAVA/"><img alt="Project Page" src="https://img.shields.io/badge/Project_Page-online-2c8ebb.svg"></a>
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-green.svg">
<img alt="Params" src="https://img.shields.io/badge/params-6.3B-orange.svg">
<img alt="Base model" src="https://img.shields.io/badge/base-Wan2.2--TI2V--5B-7c5cff.svg">
</p>
<p align="center">
<b>ERNIE Team</b> Β· Baidu Inc. Β· arXiv 2026
</p>
<p align="center">
β <b>If you find this model useful, please consider giving our <a href="https://github.com/ernie-research/NAVA">GitHub repo</a> a star!</b> β
</p>
<p align="center">
π <a href="https://huggingface.co/baidu/NAVA/blob/main/README_zh.md"><b>δΈζη README</b></a>
</p>
---
## TL;DR
NAVA is a **6.3 B-parameter joint audio-video generator** that synthesizes synchronized video **and** audio from a single prompt β including multi-speaker speech with reference-timbre control and image-conditioned continuations.
Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an **Align-then-Fuse MMDiT**: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using **2Γ to 5Γ fewer parameters** than open-source baselines.
> **Highlights**
> - **720p 1-min Fast Generation** β 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel.
> - **Dual-Channel Audio** β stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
> - **Precise Multi-Timbre Control** β reference WAVs bound to `<S>...<E>` speech spans for per-speaker voice identity.
> - **Language-Described Camera Control** β shot composition, motion, and pacing directly from the prompt.
> - **Multi-Resolution** β landscape / portrait / square aspect ratios from the same checkpoint.
---
## Model Details
### Quick Facts
| | |
|---|---|
| **Architecture** | Align-then-Fuse MMDiT (Wan2.2 backbone) |
| **Parameters** | **6.3 B** (backbone, joint AV) |
| **Modality** | Joint audio + video, text-conditioned |
| **Resolution** | 1280Γ704 (recommended) Β· 960Γ960 also supported |
| **Frames / FPS** | 37 frames @ 24 fps β 6 s Β· 55β61 frames β 9β10 s |
| **Audio** | 25 latent tokens / sec, β€ 10 s |
| **Sampling** | Flow matching Β· UniPC scheduler Β· 50 default steps |
| **Precision** | bf16 |
| **Parallelism** | Single-GPU **or** Ulysses sequence parallel (up to 8 GPUs) |
| **Base model** | [Wan-AI/Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) |
### Architecture
<p align="center">
<img src="assets/arch.png" alt="NAVA Architecture" width="900">
</p>
NAVA instantiates *Native Audio-Visual Alignment* as an **Align-then-Fuse MMDiT** stack:
- **Hierarchical Alignment Layers β 10 double-stream blocks.** Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated `[video_tokens; audio_tokens]`, plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.
- **Unified Fusion Layers β 20 single-stream blocks.** Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
- **Backbone hyperparameters.** `dim=3072`, `ffn_dim=14336`, 24 attention heads, 30 layers (10 double + 20 single), `text_len=512`, patch size `(1, 2, 2)`. RMSNorm on QK; cross-attention norm; Ξ΅ = 1e-6.
- **Positional encoding.** 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
- **Timbre-in-Context Conditioning.** Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to `<S>...<E>` speech spans, enabling per-speaker timbre control in multi-speaker scenes.
- **3D cross-modal CFG.** Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (`video_align_guidance_scale`, `audio_align_guidance_scale`) keep AV synchronization tight at inference.
### What's Different from Existing Open-Source AV Models
| Design axis | Typical baselines | **NAVA** |
|---|---|---|
| Stream layout | Dual-tower (post-hoc align) **or** fully unified tri-modal | **Align-then-Fuse** β alignment space first, context fused after |
| Speech control | Caption-only, no per-speaker timbre | **Timbre-in-Context** via reference WAVs |
| Param budget | 10 B β 32 B | **6.3 B** |
### Components Shipped Alongside the Backbone
| Component | Description | Size |
|---|---|---|
| **WanAVModel** (backbone) | MMDiT, joint AV attention | 6.3 B |
| **Wan2.2 Video VAE** | Causal 3D ConvNet Β· 16Γ16Γ4 spatial-temporal compression Β· 48 latent channels | 2.7 GB |
| **LTX Audio VAE + Vocoder** | 128 latent channels Β· 25 tokens/sec Β· built-in waveform decoder | 348 MB |
| **umt5-xxl Text Encoder** | T5 Β· 4096-d embeddings | 11 GB |
| **ReDimNet** | Speaker embedding Β· 192-d | ~50 MB |
---
## Evaluation
### Table 1 β VerseBench (general AV capability)
NAVA achieves the **best** AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.
| Model | Params | Resolution | Sync-C β | Sync-D β | IB β | Video Quality β | WER β | PQ β | FD β |
|---|---|---|---|---|---|---|---|---|---|
| Ovi 1.1 | 10 B | 720p | <u>7.4839</u> | 7.9791 | 0.199 | <u>0.636</u> | 0.102 | 5.8432 | 0.9418 |
| MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | **7.2331** | 0.9222 |
| Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 |
| LTX 2.3 | 19 B | 512p | 7.2476 | <u>7.6902</u> | **0.337** | 0.576 | 0.106 | <u>6.9459</u> | **0.8287** |
| **NAVA (ours)** | **6.3 B** | 720p | **7.7914** | **7.5655** | <u>0.313</u> | **0.659** | **0.099** | 6.8609 | <u>0.8328</u> |
<sub>β higher is better Β· β lower is better Β· **bold** = best Β· <u>underline</u> = 2nd best.</sub>
### Table 2 β Seed-TTS-eval (speech quality)
Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed *for reference*; they are not directly comparable.
| Category | Model | WER β | Speaker Similarity β |
|---|---|---|---|
| Audio-Only *(reference)* | CosyVoice | 4.29 | 60.9 |
| Audio-Only *(reference)* | Qwen2.5-Omni | 2.72 | 63.2 |
| Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 |
| Audio-Video Joint | **NAVA (ours)** | **5.81** | **62.4** |
---
## How to Use
> **TL;DR command.** After Β§1 setup is complete:
> ```bash
> bash scripts/inference.sh # General T2AV
> bash scripts/inference_timbre.sh # I2AV + timbre control
> ```
> Outputs land under `eval_results/`.
### 1 Β· Setup (once)
```bash
git clone https://github.com/ernie-research/NAVA && cd NAVA
# Python deps
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation
# All weights in one shot β main checkpoint + Wan2.2 VAE + T5 + LTX audio VAE
huggingface-cli download <NAVA-repo-id> --local-dir .
```
<details>
<summary><b>Expected on-disk layout</b></summary>
```
NAVA/
βββ NAVA.ckpt # main checkpoint (24 GB)
βββ Wan2.2-TI2V-5B/
β βββ Wan2.2_VAE.pth # 2.7 GB
β βββ models_t5_umt5-xxl-enc-bf16.pth # 11 GB
β βββ google/umt5-xxl/{spiece.model, tokenizer.json}
βββ params/
β βββ LTX2/
β βββ ltx-2.3-22b-dev_audio_vae.safetensors # 348 MB
β βββ LICENSE # LTX-2 Community License
βββ configs/ # inference YAMLs
```
The LTX audio-VAE Python code is vendored under `nava_src/vendor/ltx_core/` (see its `NOTICE.md`), so no separate clone of the LTX-Video repo is needed. ReDimNet is fetched via `torch.hub` on first run.
</details>
### 2 Β· One-command inference (recommended, 8 GPU SP)
The repo ships two end-to-end scripts that build a JSONL inline and launch SP=8 inference:
```bash
# General T2AV (text-only)
bash scripts/inference.sh
# I2AV + Timbre Control (first-frame image + reference voice)
bash scripts/inference_timbre.sh
```
Override defaults via env vars:
```bash
CKPT=/path/to/NAVA.ckpt OUT_DIR=eval_results/run1 bash scripts/inference.sh
TIMBRE_SCALE=3.0 SPK_WAV=/path/to/spk.wav bash scripts/inference_timbre.sh
```
### 3 Β· Custom batches β write your own JSONL
Each line is one prompt:
```jsonl
{"prompt": "δΈδ½η·εε¨ζ΅·θΎΉε₯θ·οΌι倴θ·ιγθζ―ζ―ζ΅·ζ΅ͺε£°ει£ε£°γ"}
{"prompt": "δΈ€δΊΊε―Ήθ―<S>Hello<E><S>Hi there<E>", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "ι倴θ·ιδΈ»δ½...", "image_path": "/abs/path/first_frame.png"}
```
| Field | Required | Description |
|---|---|---|
| `prompt` | yes | Text caption (also accepts legacy `text` field name) |
| `image_path` | no | Absolute path to first-frame image β auto-enables I2V for this sample |
| `spk_wavs` | no | List of absolute paths to speaker reference WAVs (max 2) |
Then launch:
```bash
SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
--nnodes=1 --nproc_per_node=8 \
--master_addr=127.0.0.1 --master_port=29507 \
inference_nava.py \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt \
--out_dir ./outputs \
--data_format json --data_file my_prompts.jsonl \
--width 1280 --height 704 --frames 37 --fps 24 \
--steps 50 --save_sample --gen_turn 1 --use_sp
```
Outputs land at `outputs/{save_path}-{gen_turn}_av.mp4`. For timbre-controlled samples, also pass `--timbre_cfg --timbre_align_guidance_scale 3.0`.
#### Mode cheatsheet
| Goal | JSONL fields | Extra flags |
|---|---|---|
| Text β AV | `prompt` | β |
| Image β AV | `prompt` + `image_path` | (auto-detected) |
| Timbre-controlled speech | `prompt` + `spk_wavs` | `--timbre_cfg --timbre_align_guidance_scale 3.0` |
| 9-second video | any | `--frames 55` |
| Single-GPU (slower) | any | omit `--use_sp` |
### 4 Β· Prompt rewriting (recommended for short / English inputs)
NAVA is trained on Chinese dense captions; short or English prompts benefit substantially from rewriting before inference. Three pathways are provided, all sharing the same system prompt and sampling profile (so output style stays consistent), with `<S>...<E>` speech spans preserved verbatim.
| Pathway | Backend | Speed | Best for |
|---|---|---|---|
| **vLLM batch server** (`pe_src/`) | Qwen3-4B-Thinking-2507 served via vLLM, async HTTP | **< 2 s** / prompt | Offline batches |
| **Local transformers, single** (`gradio_demo/rewrite_single.py`) | Same model, in-process | 40β80 s / prompt | One-off CLI |
| **Gradio "Rewrite" button** | Same as above, hosted in Gradio | 40β80 s / prompt | Interactive UI |
```bash
# Batch path: start vLLM server, then rewrite a txt of prompts
bash pe_src/start_server.sh --gpu 0 --low-footprint
python pe_src/rewrite.py -i prompts.txt -o prompts_rewritten.txt
```
### 5 Β· Gradio Web UI
Interactive demo with click-to-rewrite (Qwen3-4B), image upload, and reference-WAV upload:
```bash
bash gradio_demo/start_gradio.sh \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt \
--rewrite_model pe_src/Qwen3-4B-Thinking-2507 \
--port 8000 --nproc 8
```
<details>
<summary><b>Debug mode (no models, UI only)</b></summary>
```bash
python gradio_demo/gradio_server.py --debug --port 8000
```
</details>
---
## Bias, Safety, and Misuse
NAVA can synthesize video and speech conditioned on a reference image (`image_path`) and reference voice (`spk_wavs`). Using it to depict real persons without consent β including face-likeness or voice-likeness reproduction β is prohibited by the license and may also be illegal in your jurisdiction. We recommend:
1. Only use **consent-approved** reference media.
2. **Label generated content as synthetic.**
3. Apply **provenance / watermarking** before redistribution.
---
## Citation
```bibtex
@article{nava2026,
title = {NAVA: Native Audio-Visual Alignment for Joint Audio-Video Generation},
author = {ERNIE Team},
journal = {arXiv preprint},
year = {2026},
}
```
## Acknowledgements
NAVA builds on excellent upstream work: **Wan2.2-TI2V-5B** (video backbone & VAE), **LTX 2.3** (audio VAE + built-in vocoder), **umt5-xxl** (text encoder), and **ReDimNet** (speaker embedding). We also thank the open-source AV-generation community β Ovi, MOVA, Davinci, LTX β for releasing strong baselines that made fair benchmarking possible.
## License & Contact
Released under **Apache-2.0**. For research / commercial inquiries, contact the **ERNIE team at Baidu Inc.**
|