Instructions to use baidu/NAVA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use baidu/NAVA with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
license: apache-2.0
language:
- en
- zh
tags:
- text-to-video
- text-to-audio-video
- audio-video-generation
- mmdit
- flow-matching
- wan2.2
pipeline_tag: text-to-video
library_name: custom
base_model: Wan-AI/Wan2.2-TI2V-5B
NAVA β Native Audio-Visual Alignment for Generation
State-of-the-art audio-visual synchronization with only 6.3 B parameters.
ERNIE Team Β· Baidu Inc. Β· arXiv 2026
β If you find this model useful, please consider giving our GitHub repo a star! β
π δΈζη README
TL;DR
NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt β including multi-speaker speech with reference-timbre control and image-conditioned continuations.
Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2Γ to 5Γ fewer parameters than open-source baselines.
Highlights
- 720p 1-min Fast Generation β 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel.
- Dual-Channel Audio β stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
- Precise Multi-Timbre Control β reference WAVs bound to
<S>...<E>speech spans for per-speaker voice identity.- Language-Described Camera Control β shot composition, motion, and pacing directly from the prompt.
- Multi-Resolution β landscape / portrait / square aspect ratios from the same checkpoint.
Model Details
Quick Facts
| Architecture | Align-then-Fuse MMDiT (Wan2.2 backbone) |
| Parameters | 6.3 B (backbone, joint AV) |
| Modality | Joint audio + video, text-conditioned |
| Resolution | 1280Γ704 (recommended) Β· 960Γ960 also supported |
| Frames / FPS | 37 frames @ 24 fps β 6 s Β· 55β61 frames β 9β10 s |
| Audio | 25 latent tokens / sec, β€ 10 s |
| Sampling | Flow matching Β· UniPC scheduler Β· 50 default steps |
| Precision | bf16 |
| Parallelism | Single-GPU or Ulysses sequence parallel (up to 8 GPUs) |
| Base model | Wan-AI/Wan2.2-TI2V-5B |
Architecture
NAVA instantiates Native Audio-Visual Alignment as an Align-then-Fuse MMDiT stack:
- Hierarchical Alignment Layers β 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated
[video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference. - Unified Fusion Layers β 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
- Backbone hyperparameters.
dim=3072,ffn_dim=14336, 24 attention heads, 30 layers (10 double + 20 single),text_len=512, patch size(1, 2, 2). RMSNorm on QK; cross-attention norm; Ξ΅ = 1e-6. - Positional encoding. 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
- Timbre-in-Context Conditioning. Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to
<S>...<E>speech spans, enabling per-speaker timbre control in multi-speaker scenes. - 3D cross-modal CFG. Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (
video_align_guidance_scale,audio_align_guidance_scale) keep AV synchronization tight at inference.
What's Different from Existing Open-Source AV Models
| Design axis | Typical baselines | NAVA |
|---|---|---|
| Stream layout | Dual-tower (post-hoc align) or fully unified tri-modal | Align-then-Fuse β alignment space first, context fused after |
| Speech control | Caption-only, no per-speaker timbre | Timbre-in-Context via reference WAVs |
| Param budget | 10 B β 32 B | 6.3 B |
Components Shipped Alongside the Backbone
| Component | Description | Size |
|---|---|---|
| WanAVModel (backbone) | MMDiT, joint AV attention | 6.3 B |
| Wan2.2 Video VAE | Causal 3D ConvNet Β· 16Γ16Γ4 spatial-temporal compression Β· 48 latent channels | 2.7 GB |
| LTX Audio VAE + Vocoder | 128 latent channels Β· 25 tokens/sec Β· built-in waveform decoder | 348 MB |
| umt5-xxl Text Encoder | T5 Β· 4096-d embeddings | 11 GB |
| ReDimNet | Speaker embedding Β· 192-d | ~50 MB |
Evaluation
Table 1 β VerseBench (general AV capability)
NAVA achieves the best AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.
| Model | Params | Resolution | Sync-C β | Sync-D β | IB β | Video Quality β | WER β | PQ β | FD β |
|---|---|---|---|---|---|---|---|---|---|
| Ovi 1.1 | 10 B | 720p | 7.4839 | 7.9791 | 0.199 | 0.636 | 0.102 | 5.8432 | 0.9418 |
| MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | 7.2331 | 0.9222 |
| Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 |
| LTX 2.3 | 19 B | 512p | 7.2476 | 7.6902 | 0.337 | 0.576 | 0.106 | 6.9459 | 0.8287 |
| NAVA (ours) | 6.3 B | 720p | 7.7914 | 7.5655 | 0.313 | 0.659 | 0.099 | 6.8609 | 0.8328 |
β higher is better Β· β lower is better Β· bold = best Β· underline = 2nd best.
Table 2 β Seed-TTS-eval (speech quality)
Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed for reference; they are not directly comparable.
| Category | Model | WER β | Speaker Similarity β |
|---|---|---|---|
| Audio-Only (reference) | CosyVoice | 4.29 | 60.9 |
| Audio-Only (reference) | Qwen2.5-Omni | 2.72 | 63.2 |
| Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 |
| Audio-Video Joint | NAVA (ours) | 5.81 | 62.4 |
How to Use
TL;DR command. After Β§1 setup is complete:
bash scripts/inference.sh # General T2AV bash scripts/inference_timbre.sh # I2AV + timbre controlOutputs land under
eval_results/.
1 Β· Setup (once)
git clone https://github.com/ernie-research/NAVA && cd NAVA
# Python deps
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation
# All weights in one shot β main checkpoint + Wan2.2 VAE + T5 + LTX audio VAE
huggingface-cli download <NAVA-repo-id> --local-dir .
Expected on-disk layout
NAVA/
βββ NAVA.ckpt # main checkpoint (24 GB)
βββ Wan2.2-TI2V-5B/
β βββ Wan2.2_VAE.pth # 2.7 GB
β βββ models_t5_umt5-xxl-enc-bf16.pth # 11 GB
β βββ google/umt5-xxl/{spiece.model, tokenizer.json}
βββ params/
β βββ LTX2/
β βββ ltx-2.3-22b-dev_audio_vae.safetensors # 348 MB
β βββ LICENSE # LTX-2 Community License
βββ configs/ # inference YAMLs
The LTX audio-VAE Python code is vendored under nava_src/vendor/ltx_core/ (see its NOTICE.md), so no separate clone of the LTX-Video repo is needed. ReDimNet is fetched via torch.hub on first run.
2 Β· One-command inference (recommended, 8 GPU SP)
The repo ships two end-to-end scripts that build a JSONL inline and launch SP=8 inference:
# General T2AV (text-only)
bash scripts/inference.sh
# I2AV + Timbre Control (first-frame image + reference voice)
bash scripts/inference_timbre.sh
Override defaults via env vars:
CKPT=/path/to/NAVA.ckpt OUT_DIR=eval_results/run1 bash scripts/inference.sh
TIMBRE_SCALE=3.0 SPK_WAV=/path/to/spk.wav bash scripts/inference_timbre.sh
3 Β· Custom batches β write your own JSONL
Each line is one prompt:
{"prompt": "δΈδ½η·εε¨ζ΅·θΎΉε₯θ·οΌι倴θ·ιγθζ―ζ―ζ΅·ζ΅ͺε£°ει£ε£°γ"}
{"prompt": "δΈ€δΊΊε―Ήθ―<S>Hello<E><S>Hi there<E>", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "ι倴θ·ιδΈ»δ½...", "image_path": "/abs/path/first_frame.png"}
| Field | Required | Description |
|---|---|---|
prompt |
yes | Text caption (also accepts legacy text field name) |
image_path |
no | Absolute path to first-frame image β auto-enables I2V for this sample |
spk_wavs |
no | List of absolute paths to speaker reference WAVs (max 2) |
Then launch:
SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
--nnodes=1 --nproc_per_node=8 \
--master_addr=127.0.0.1 --master_port=29507 \
inference_nava.py \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt \
--out_dir ./outputs \
--data_format json --data_file my_prompts.jsonl \
--width 1280 --height 704 --frames 37 --fps 24 \
--steps 50 --save_sample --gen_turn 1 --use_sp
Outputs land at outputs/{save_path}-{gen_turn}_av.mp4. For timbre-controlled samples, also pass --timbre_cfg --timbre_align_guidance_scale 3.0.
Mode cheatsheet
| Goal | JSONL fields | Extra flags |
|---|---|---|
| Text β AV | prompt |
β |
| Image β AV | prompt + image_path |
(auto-detected) |
| Timbre-controlled speech | prompt + spk_wavs |
--timbre_cfg --timbre_align_guidance_scale 3.0 |
| 9-second video | any | --frames 55 |
| Single-GPU (slower) | any | omit --use_sp |
4 Β· Prompt rewriting (recommended for short / English inputs)
NAVA is trained on Chinese dense captions; short or English prompts benefit substantially from rewriting before inference. Three pathways are provided, all sharing the same system prompt and sampling profile (so output style stays consistent), with <S>...<E> speech spans preserved verbatim.
| Pathway | Backend | Speed | Best for |
|---|---|---|---|
vLLM batch server (pe_src/) |
Qwen3-4B-Thinking-2507 served via vLLM, async HTTP | < 2 s / prompt | Offline batches |
Local transformers, single (gradio_demo/rewrite_single.py) |
Same model, in-process | 40β80 s / prompt | One-off CLI |
| Gradio "Rewrite" button | Same as above, hosted in Gradio | 40β80 s / prompt | Interactive UI |
# Batch path: start vLLM server, then rewrite a txt of prompts
bash pe_src/start_server.sh --gpu 0 --low-footprint
python pe_src/rewrite.py -i prompts.txt -o prompts_rewritten.txt
5 Β· Gradio Web UI
Interactive demo with click-to-rewrite (Qwen3-4B), image upload, and reference-WAV upload:
bash gradio_demo/start_gradio.sh \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt \
--rewrite_model pe_src/Qwen3-4B-Thinking-2507 \
--port 8000 --nproc 8
Debug mode (no models, UI only)
python gradio_demo/gradio_server.py --debug --port 8000
Bias, Safety, and Misuse
NAVA can synthesize video and speech conditioned on a reference image (image_path) and reference voice (spk_wavs). Using it to depict real persons without consent β including face-likeness or voice-likeness reproduction β is prohibited by the license and may also be illegal in your jurisdiction. We recommend:
- Only use consent-approved reference media.
- Label generated content as synthetic.
- Apply provenance / watermarking before redistribution.
Citation
@article{nava2026,
title = {NAVA: Native Audio-Visual Alignment for Joint Audio-Video Generation},
author = {ERNIE Team},
journal = {arXiv preprint},
year = {2026},
}
Acknowledgements
NAVA builds on excellent upstream work: Wan2.2-TI2V-5B (video backbone & VAE), LTX 2.3 (audio VAE + built-in vocoder), umt5-xxl (text encoder), and ReDimNet (speaker embedding). We also thank the open-source AV-generation community β Ovi, MOVA, Davinci, LTX β for releasing strong baselines that made fair benchmarking possible.
License & Contact
Released under Apache-2.0. For research / commercial inquiries, contact the ERNIE team at Baidu Inc.