File size: 14,008 Bytes
b0b3bbb
 
e4a8829
 
 
 
 
 
 
 
 
 
 
 
 
b0b3bbb
e4a8829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a30b910
 
 
 
9344c13
2e8f1b7
9344c13
 
e4a8829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
license: apache-2.0
language:
  - en
  - zh
tags:
  - text-to-video
  - text-to-audio-video
  - audio-video-generation
  - mmdit
  - flow-matching
  - wan2.2
pipeline_tag: text-to-video
library_name: custom
base_model: Wan-AI/Wan2.2-TI2V-5B
---

<p align="center">
  <img src="assets/logo.png" alt="NAVA" width="160">
</p>

<h1 align="center">NAVA β€” Native Audio-Visual Alignment for Generation</h1>

<p align="center">
  <em>State-of-the-art audio-visual synchronization with only <b>6.3 B</b> parameters.</em>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2605.30073"><img alt="arXiv" src="https://img.shields.io/badge/Paper-arXiv-b31b1b.svg"></a>
  <a href="https://github.com/ernie-research/NAVA"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-181717.svg"></a>
  <a href="https://ernie-research.github.io/NAVA/"><img alt="Project Page" src="https://img.shields.io/badge/Project_Page-online-2c8ebb.svg"></a>
  <img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-green.svg">
  <img alt="Params" src="https://img.shields.io/badge/params-6.3B-orange.svg">
  <img alt="Base model" src="https://img.shields.io/badge/base-Wan2.2--TI2V--5B-7c5cff.svg">
</p>

<p align="center">
  <b>ERNIE Team</b> Β· Baidu Inc. Β· arXiv 2026
</p>

<p align="center">
  ⭐ <b>If you find this model useful, please consider giving our <a href="https://github.com/ernie-research/NAVA">GitHub repo</a> a star!</b> ⭐
</p>

<p align="center">
  πŸ“– <a href="https://huggingface.co/baidu/NAVA/blob/main/README_zh.md"><b>δΈ­ζ–‡η‰ˆ README</b></a>
</p>

---

## TL;DR

NAVA is a **6.3 B-parameter joint audio-video generator** that synthesizes synchronized video **and** audio from a single prompt β€” including multi-speaker speech with reference-timbre control and image-conditioned continuations.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an **Align-then-Fuse MMDiT**: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using **2Γ— to 5Γ— fewer parameters** than open-source baselines.

> **Highlights**
> - **720p 1-min Fast Generation** β€” 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel.
> - **Dual-Channel Audio** β€” stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
> - **Precise Multi-Timbre Control** β€” reference WAVs bound to `<S>...<E>` speech spans for per-speaker voice identity.
> - **Language-Described Camera Control** β€” shot composition, motion, and pacing directly from the prompt.
> - **Multi-Resolution** β€” landscape / portrait / square aspect ratios from the same checkpoint.

---

## Model Details

### Quick Facts

| | |
|---|---|
| **Architecture** | Align-then-Fuse MMDiT (Wan2.2 backbone) |
| **Parameters** | **6.3 B** (backbone, joint AV) |
| **Modality** | Joint audio + video, text-conditioned |
| **Resolution** | 1280Γ—704 (recommended) Β· 960Γ—960 also supported |
| **Frames / FPS** | 37 frames @ 24 fps β‰ˆ 6 s Β· 55–61 frames β‰ˆ 9–10 s |
| **Audio** | 25 latent tokens / sec, ≀ 10 s |
| **Sampling** | Flow matching Β· UniPC scheduler Β· 50 default steps |
| **Precision** | bf16 |
| **Parallelism** | Single-GPU **or** Ulysses sequence parallel (up to 8 GPUs) |
| **Base model** | [Wan-AI/Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) |

### Architecture

<p align="center">
  <img src="assets/arch.png" alt="NAVA Architecture" width="900">
</p>

NAVA instantiates *Native Audio-Visual Alignment* as an **Align-then-Fuse MMDiT** stack:

- **Hierarchical Alignment Layers β€” 10 double-stream blocks.** Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated `[video_tokens; audio_tokens]`, plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.
- **Unified Fusion Layers β€” 20 single-stream blocks.** Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
- **Backbone hyperparameters.** `dim=3072`, `ffn_dim=14336`, 24 attention heads, 30 layers (10 double + 20 single), `text_len=512`, patch size `(1, 2, 2)`. RMSNorm on QK; cross-attention norm; Ξ΅ = 1e-6.
- **Positional encoding.** 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
- **Timbre-in-Context Conditioning.** Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to `<S>...<E>` speech spans, enabling per-speaker timbre control in multi-speaker scenes.
- **3D cross-modal CFG.** Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (`video_align_guidance_scale`, `audio_align_guidance_scale`) keep AV synchronization tight at inference.

### What's Different from Existing Open-Source AV Models

| Design axis | Typical baselines | **NAVA** |
|---|---|---|
| Stream layout | Dual-tower (post-hoc align) **or** fully unified tri-modal | **Align-then-Fuse** β€” alignment space first, context fused after |
| Speech control | Caption-only, no per-speaker timbre | **Timbre-in-Context** via reference WAVs |
| Param budget | 10 B – 32 B | **6.3 B** |

### Components Shipped Alongside the Backbone

| Component | Description | Size |
|---|---|---|
| **WanAVModel** (backbone) | MMDiT, joint AV attention | 6.3 B |
| **Wan2.2 Video VAE** | Causal 3D ConvNet Β· 16Γ—16Γ—4 spatial-temporal compression Β· 48 latent channels | 2.7 GB |
| **LTX Audio VAE + Vocoder** | 128 latent channels Β· 25 tokens/sec Β· built-in waveform decoder | 348 MB |
| **umt5-xxl Text Encoder** | T5 Β· 4096-d embeddings | 11 GB |
| **ReDimNet** | Speaker embedding Β· 192-d | ~50 MB |

---

## Evaluation

### Table 1 β€” VerseBench (general AV capability)

NAVA achieves the **best** AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.

| Model | Params | Resolution | Sync-C ↑ | Sync-D ↓ | IB ↑ | Video Quality ↑ | WER ↓ | PQ ↑ | FD ↓ |
|---|---|---|---|---|---|---|---|---|---|
| Ovi 1.1 | 10 B | 720p | <u>7.4839</u> | 7.9791 | 0.199 | <u>0.636</u> | 0.102 | 5.8432 | 0.9418 |
| MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | **7.2331** | 0.9222 |
| Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 |
| LTX 2.3 | 19 B | 512p | 7.2476 | <u>7.6902</u> | **0.337** | 0.576 | 0.106 | <u>6.9459</u> | **0.8287** |
| **NAVA (ours)** | **6.3 B** | 720p | **7.7914** | **7.5655** | <u>0.313</u> | **0.659** | **0.099** | 6.8609 | <u>0.8328</u> |

<sub>↑ higher is better Β· ↓ lower is better Β· **bold** = best Β· <u>underline</u> = 2nd best.</sub>

### Table 2 β€” Seed-TTS-eval (speech quality)

Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed *for reference*; they are not directly comparable.

| Category | Model | WER ↓ | Speaker Similarity ↑ |
|---|---|---|---|
| Audio-Only *(reference)* | CosyVoice | 4.29 | 60.9 |
| Audio-Only *(reference)* | Qwen2.5-Omni | 2.72 | 63.2 |
| Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 |
| Audio-Video Joint | **NAVA (ours)** | **5.81** | **62.4** |

---

## How to Use

> **TL;DR command.** After Β§1 setup is complete:
> ```bash
> bash scripts/inference.sh           # General T2AV
> bash scripts/inference_timbre.sh    # I2AV + timbre control
> ```
> Outputs land under `eval_results/`.

### 1 Β· Setup (once)

```bash
git clone https://github.com/ernie-research/NAVA && cd NAVA

# Python deps
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation

# All weights in one shot β€” main checkpoint + Wan2.2 VAE + T5 + LTX audio VAE
huggingface-cli download <NAVA-repo-id> --local-dir .
```

<details>
<summary><b>Expected on-disk layout</b></summary>

```
NAVA/
β”œβ”€β”€ NAVA.ckpt                                                    # main checkpoint (24 GB)
β”œβ”€β”€ Wan2.2-TI2V-5B/
β”‚   β”œβ”€β”€ Wan2.2_VAE.pth                                           # 2.7 GB
β”‚   β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth                          # 11 GB
β”‚   └── google/umt5-xxl/{spiece.model, tokenizer.json}
β”œβ”€β”€ params/
β”‚   └── LTX2/
β”‚       β”œβ”€β”€ ltx-2.3-22b-dev_audio_vae.safetensors                # 348 MB
β”‚       └── LICENSE                                              # LTX-2 Community License
└── configs/                                                     # inference YAMLs
```

The LTX audio-VAE Python code is vendored under `nava_src/vendor/ltx_core/` (see its `NOTICE.md`), so no separate clone of the LTX-Video repo is needed. ReDimNet is fetched via `torch.hub` on first run.
</details>

### 2 Β· One-command inference (recommended, 8 GPU SP)

The repo ships two end-to-end scripts that build a JSONL inline and launch SP=8 inference:

```bash
# General T2AV (text-only)
bash scripts/inference.sh

# I2AV + Timbre Control (first-frame image + reference voice)
bash scripts/inference_timbre.sh
```

Override defaults via env vars:

```bash
CKPT=/path/to/NAVA.ckpt OUT_DIR=eval_results/run1 bash scripts/inference.sh
TIMBRE_SCALE=3.0 SPK_WAV=/path/to/spk.wav    bash scripts/inference_timbre.sh
```

### 3 Β· Custom batches β€” write your own JSONL

Each line is one prompt:

```jsonl
{"prompt": "δΈ€δ½η”·ε­εœ¨ζ΅·θΎΉε₯”θ·‘οΌŒι•œε€΄θ·Ÿιšγ€‚θƒŒζ™―是桷ζ΅ͺε£°ε’Œι£Žε£°γ€‚"}
{"prompt": "一人对话<S>Hello<E><S>Hi there<E>", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "ι•œε€΄θ·ŸιšδΈ»δ½“...", "image_path": "/abs/path/first_frame.png"}
```

| Field | Required | Description |
|---|---|---|
| `prompt` | yes | Text caption (also accepts legacy `text` field name) |
| `image_path` | no | Absolute path to first-frame image β€” auto-enables I2V for this sample |
| `spk_wavs` | no | List of absolute paths to speaker reference WAVs (max 2) |

Then launch:

```bash
SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
    --nnodes=1 --nproc_per_node=8 \
    --master_addr=127.0.0.1 --master_port=29507 \
    inference_nava.py \
    --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
    --ckpt NAVA.ckpt \
    --out_dir ./outputs \
    --data_format json --data_file my_prompts.jsonl \
    --width 1280 --height 704 --frames 37 --fps 24 \
    --steps 50 --save_sample --gen_turn 1 --use_sp
```

Outputs land at `outputs/{save_path}-{gen_turn}_av.mp4`. For timbre-controlled samples, also pass `--timbre_cfg --timbre_align_guidance_scale 3.0`.

#### Mode cheatsheet

| Goal | JSONL fields | Extra flags |
|---|---|---|
| Text β†’ AV | `prompt` | β€” |
| Image β†’ AV | `prompt` + `image_path` | (auto-detected) |
| Timbre-controlled speech | `prompt` + `spk_wavs` | `--timbre_cfg --timbre_align_guidance_scale 3.0` |
| 9-second video | any | `--frames 55` |
| Single-GPU (slower) | any | omit `--use_sp` |

### 4 Β· Prompt rewriting (recommended for short / English inputs)

NAVA is trained on Chinese dense captions; short or English prompts benefit substantially from rewriting before inference. Three pathways are provided, all sharing the same system prompt and sampling profile (so output style stays consistent), with `<S>...<E>` speech spans preserved verbatim.

| Pathway | Backend | Speed | Best for |
|---|---|---|---|
| **vLLM batch server** (`pe_src/`) | Qwen3-4B-Thinking-2507 served via vLLM, async HTTP | **< 2 s** / prompt | Offline batches |
| **Local transformers, single** (`gradio_demo/rewrite_single.py`) | Same model, in-process | 40–80 s / prompt | One-off CLI |
| **Gradio "Rewrite" button** | Same as above, hosted in Gradio | 40–80 s / prompt | Interactive UI |

```bash
# Batch path: start vLLM server, then rewrite a txt of prompts
bash pe_src/start_server.sh --gpu 0 --low-footprint
python pe_src/rewrite.py -i prompts.txt -o prompts_rewritten.txt
```

### 5 Β· Gradio Web UI

Interactive demo with click-to-rewrite (Qwen3-4B), image upload, and reference-WAV upload:

```bash
bash gradio_demo/start_gradio.sh \
    --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
    --ckpt NAVA.ckpt \
    --rewrite_model pe_src/Qwen3-4B-Thinking-2507 \
    --port 8000 --nproc 8
```

<details>
<summary><b>Debug mode (no models, UI only)</b></summary>

```bash
python gradio_demo/gradio_server.py --debug --port 8000
```
</details>

---

## Bias, Safety, and Misuse

NAVA can synthesize video and speech conditioned on a reference image (`image_path`) and reference voice (`spk_wavs`). Using it to depict real persons without consent β€” including face-likeness or voice-likeness reproduction β€” is prohibited by the license and may also be illegal in your jurisdiction. We recommend:

1. Only use **consent-approved** reference media.
2. **Label generated content as synthetic.**
3. Apply **provenance / watermarking** before redistribution.

---

## Citation

```bibtex
@article{nava2026,
  title   = {NAVA: Native Audio-Visual Alignment for Joint Audio-Video Generation},
  author  = {ERNIE Team},
  journal = {arXiv preprint},
  year    = {2026},
}
```

## Acknowledgements

NAVA builds on excellent upstream work: **Wan2.2-TI2V-5B** (video backbone & VAE), **LTX 2.3** (audio VAE + built-in vocoder), **umt5-xxl** (text encoder), and **ReDimNet** (speaker embedding). We also thank the open-source AV-generation community β€” Ovi, MOVA, Davinci, LTX β€” for releasing strong baselines that made fair benchmarking possible.

## License & Contact

Released under **Apache-2.0**. For research / commercial inquiries, contact the **ERNIE team at Baidu Inc.**