Text-to-Video
Wan2.2
English
Chinese
custom
ti2v
text-to-audio-video
audio-video-generation
mmdit
flow-matching
Instructions to use baidu/NAVA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use baidu/NAVA with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| tags: | |
| - text-to-video | |
| - text-to-audio-video | |
| - audio-video-generation | |
| - mmdit | |
| - flow-matching | |
| - wan2.2 | |
| pipeline_tag: text-to-video | |
| library_name: custom | |
| base_model: Wan-AI/Wan2.2-TI2V-5B | |
| <p align="center"> | |
| <img src="assets/logo.png" alt="NAVA" width="160"> | |
| </p> | |
| <h1 align="center">NAVA β Native Audio-Visual Alignment for Generation</h1> | |
| <p align="center"> | |
| <em>State-of-the-art audio-visual synchronization with only <b>6.3 B</b> parameters.</em> | |
| </p> | |
| <p align="center"> | |
| <a href="https://arxiv.org/abs/2605.30073"><img alt="arXiv" src="https://img.shields.io/badge/Paper-arXiv-b31b1b.svg"></a> | |
| <a href="https://github.com/ernie-research/NAVA"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-181717.svg"></a> | |
| <a href="https://ernie-research.github.io/NAVA/"><img alt="Project Page" src="https://img.shields.io/badge/Project_Page-online-2c8ebb.svg"></a> | |
| <img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-green.svg"> | |
| <img alt="Params" src="https://img.shields.io/badge/params-6.3B-orange.svg"> | |
| <img alt="Base model" src="https://img.shields.io/badge/base-Wan2.2--TI2V--5B-7c5cff.svg"> | |
| </p> | |
| <p align="center"> | |
| <b>ERNIE Team</b> Β· Baidu Inc. Β· arXiv 2026 | |
| </p> | |
| <p align="center"> | |
| β <b>If you find this model useful, please consider giving our <a href="https://github.com/ernie-research/NAVA">GitHub repo</a> a star!</b> β | |
| </p> | |
| <p align="center"> | |
| π <a href="https://huggingface.co/baidu/NAVA/blob/main/README_zh.md"><b>δΈζη README</b></a> | |
| </p> | |
| --- | |
| ## TL;DR | |
| NAVA is a **6.3 B-parameter joint audio-video generator** that synthesizes synchronized video **and** audio from a single prompt β including multi-speaker speech with reference-timbre control and image-conditioned continuations. | |
| Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an **Align-then-Fuse MMDiT**: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using **2Γ to 5Γ fewer parameters** than open-source baselines. | |
| > **Highlights** | |
| > - **720p 1-min Fast Generation** β 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel. | |
| > - **Dual-Channel Audio** β stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment. | |
| > - **Precise Multi-Timbre Control** β reference WAVs bound to `<S>...<E>` speech spans for per-speaker voice identity. | |
| > - **Language-Described Camera Control** β shot composition, motion, and pacing directly from the prompt. | |
| > - **Multi-Resolution** β landscape / portrait / square aspect ratios from the same checkpoint. | |
| --- | |
| ## Model Details | |
| ### Quick Facts | |
| | | | | |
| |---|---| | |
| | **Architecture** | Align-then-Fuse MMDiT (Wan2.2 backbone) | | |
| | **Parameters** | **6.3 B** (backbone, joint AV) | | |
| | **Modality** | Joint audio + video, text-conditioned | | |
| | **Resolution** | 1280Γ704 (recommended) Β· 960Γ960 also supported | | |
| | **Frames / FPS** | 37 frames @ 24 fps β 6 s Β· 55β61 frames β 9β10 s | | |
| | **Audio** | 25 latent tokens / sec, β€ 10 s | | |
| | **Sampling** | Flow matching Β· UniPC scheduler Β· 50 default steps | | |
| | **Precision** | bf16 | | |
| | **Parallelism** | Single-GPU **or** Ulysses sequence parallel (up to 8 GPUs) | | |
| | **Base model** | [Wan-AI/Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) | | |
| ### Architecture | |
| <p align="center"> | |
| <img src="assets/arch.png" alt="NAVA Architecture" width="900"> | |
| </p> | |
| NAVA instantiates *Native Audio-Visual Alignment* as an **Align-then-Fuse MMDiT** stack: | |
| - **Hierarchical Alignment Layers β 10 double-stream blocks.** Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated `[video_tokens; audio_tokens]`, plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference. | |
| - **Unified Fusion Layers β 20 single-stream blocks.** Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens. | |
| - **Backbone hyperparameters.** `dim=3072`, `ffn_dim=14336`, 24 attention heads, 30 layers (10 double + 20 single), `text_len=512`, patch size `(1, 2, 2)`. RMSNorm on QK; cross-attention norm; Ξ΅ = 1e-6. | |
| - **Positional encoding.** 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path. | |
| - **Timbre-in-Context Conditioning.** Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to `<S>...<E>` speech spans, enabling per-speaker timbre control in multi-speaker scenes. | |
| - **3D cross-modal CFG.** Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (`video_align_guidance_scale`, `audio_align_guidance_scale`) keep AV synchronization tight at inference. | |
| ### What's Different from Existing Open-Source AV Models | |
| | Design axis | Typical baselines | **NAVA** | | |
| |---|---|---| | |
| | Stream layout | Dual-tower (post-hoc align) **or** fully unified tri-modal | **Align-then-Fuse** β alignment space first, context fused after | | |
| | Speech control | Caption-only, no per-speaker timbre | **Timbre-in-Context** via reference WAVs | | |
| | Param budget | 10 B β 32 B | **6.3 B** | | |
| ### Components Shipped Alongside the Backbone | |
| | Component | Description | Size | | |
| |---|---|---| | |
| | **WanAVModel** (backbone) | MMDiT, joint AV attention | 6.3 B | | |
| | **Wan2.2 Video VAE** | Causal 3D ConvNet Β· 16Γ16Γ4 spatial-temporal compression Β· 48 latent channels | 2.7 GB | | |
| | **LTX Audio VAE + Vocoder** | 128 latent channels Β· 25 tokens/sec Β· built-in waveform decoder | 348 MB | | |
| | **umt5-xxl Text Encoder** | T5 Β· 4096-d embeddings | 11 GB | | |
| | **ReDimNet** | Speaker embedding Β· 192-d | ~50 MB | | |
| --- | |
| ## Evaluation | |
| ### Table 1 β VerseBench (general AV capability) | |
| NAVA achieves the **best** AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget. | |
| | Model | Params | Resolution | Sync-C β | Sync-D β | IB β | Video Quality β | WER β | PQ β | FD β | | |
| |---|---|---|---|---|---|---|---|---|---| | |
| | Ovi 1.1 | 10 B | 720p | <u>7.4839</u> | 7.9791 | 0.199 | <u>0.636</u> | 0.102 | 5.8432 | 0.9418 | | |
| | MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | **7.2331** | 0.9222 | | |
| | Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 | | |
| | LTX 2.3 | 19 B | 512p | 7.2476 | <u>7.6902</u> | **0.337** | 0.576 | 0.106 | <u>6.9459</u> | **0.8287** | | |
| | **NAVA (ours)** | **6.3 B** | 720p | **7.7914** | **7.5655** | <u>0.313</u> | **0.659** | **0.099** | 6.8609 | <u>0.8328</u> | | |
| <sub>β higher is better Β· β lower is better Β· **bold** = best Β· <u>underline</u> = 2nd best.</sub> | |
| ### Table 2 β Seed-TTS-eval (speech quality) | |
| Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed *for reference*; they are not directly comparable. | |
| | Category | Model | WER β | Speaker Similarity β | | |
| |---|---|---|---| | |
| | Audio-Only *(reference)* | CosyVoice | 4.29 | 60.9 | | |
| | Audio-Only *(reference)* | Qwen2.5-Omni | 2.72 | 63.2 | | |
| | Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 | | |
| | Audio-Video Joint | **NAVA (ours)** | **5.81** | **62.4** | | |
| --- | |
| ## How to Use | |
| > **TL;DR command.** After Β§1 setup is complete: | |
| > ```bash | |
| > bash scripts/inference.sh # General T2AV | |
| > bash scripts/inference_timbre.sh # I2AV + timbre control | |
| > ``` | |
| > Outputs land under `eval_results/`. | |
| ### 1 Β· Setup (once) | |
| ```bash | |
| git clone https://github.com/ernie-research/NAVA && cd NAVA | |
| # Python deps | |
| pip install torch torchvision torchaudio | |
| pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece | |
| pip install flash-attn --no-build-isolation | |
| # All weights in one shot β main checkpoint + Wan2.2 VAE + T5 + LTX audio VAE | |
| huggingface-cli download <NAVA-repo-id> --local-dir . | |
| ``` | |
| <details> | |
| <summary><b>Expected on-disk layout</b></summary> | |
| ``` | |
| NAVA/ | |
| βββ NAVA.ckpt # main checkpoint (24 GB) | |
| βββ Wan2.2-TI2V-5B/ | |
| β βββ Wan2.2_VAE.pth # 2.7 GB | |
| β βββ models_t5_umt5-xxl-enc-bf16.pth # 11 GB | |
| β βββ google/umt5-xxl/{spiece.model, tokenizer.json} | |
| βββ params/ | |
| β βββ LTX2/ | |
| β βββ ltx-2.3-22b-dev_audio_vae.safetensors # 348 MB | |
| β βββ LICENSE # LTX-2 Community License | |
| βββ configs/ # inference YAMLs | |
| ``` | |
| The LTX audio-VAE Python code is vendored under `nava_src/vendor/ltx_core/` (see its `NOTICE.md`), so no separate clone of the LTX-Video repo is needed. ReDimNet is fetched via `torch.hub` on first run. | |
| </details> | |
| ### 2 Β· One-command inference (recommended, 8 GPU SP) | |
| The repo ships two end-to-end scripts that build a JSONL inline and launch SP=8 inference: | |
| ```bash | |
| # General T2AV (text-only) | |
| bash scripts/inference.sh | |
| # I2AV + Timbre Control (first-frame image + reference voice) | |
| bash scripts/inference_timbre.sh | |
| ``` | |
| Override defaults via env vars: | |
| ```bash | |
| CKPT=/path/to/NAVA.ckpt OUT_DIR=eval_results/run1 bash scripts/inference.sh | |
| TIMBRE_SCALE=3.0 SPK_WAV=/path/to/spk.wav bash scripts/inference_timbre.sh | |
| ``` | |
| ### 3 Β· Custom batches β write your own JSONL | |
| Each line is one prompt: | |
| ```jsonl | |
| {"prompt": "δΈδ½η·εε¨ζ΅·θΎΉε₯θ·οΌι倴θ·ιγθζ―ζ―ζ΅·ζ΅ͺε£°ει£ε£°γ"} | |
| {"prompt": "δΈ€δΊΊε―Ήθ―<S>Hello<E><S>Hi there<E>", "spk_wavs": ["spk1.wav", "spk2.wav"]} | |
| {"prompt": "ι倴θ·ιδΈ»δ½...", "image_path": "/abs/path/first_frame.png"} | |
| ``` | |
| | Field | Required | Description | | |
| |---|---|---| | |
| | `prompt` | yes | Text caption (also accepts legacy `text` field name) | | |
| | `image_path` | no | Absolute path to first-frame image β auto-enables I2V for this sample | | |
| | `spk_wavs` | no | List of absolute paths to speaker reference WAVs (max 2) | | |
| Then launch: | |
| ```bash | |
| SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \ | |
| --nnodes=1 --nproc_per_node=8 \ | |
| --master_addr=127.0.0.1 --master_port=29507 \ | |
| inference_nava.py \ | |
| --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \ | |
| --ckpt NAVA.ckpt \ | |
| --out_dir ./outputs \ | |
| --data_format json --data_file my_prompts.jsonl \ | |
| --width 1280 --height 704 --frames 37 --fps 24 \ | |
| --steps 50 --save_sample --gen_turn 1 --use_sp | |
| ``` | |
| Outputs land at `outputs/{save_path}-{gen_turn}_av.mp4`. For timbre-controlled samples, also pass `--timbre_cfg --timbre_align_guidance_scale 3.0`. | |
| #### Mode cheatsheet | |
| | Goal | JSONL fields | Extra flags | | |
| |---|---|---| | |
| | Text β AV | `prompt` | β | | |
| | Image β AV | `prompt` + `image_path` | (auto-detected) | | |
| | Timbre-controlled speech | `prompt` + `spk_wavs` | `--timbre_cfg --timbre_align_guidance_scale 3.0` | | |
| | 9-second video | any | `--frames 55` | | |
| | Single-GPU (slower) | any | omit `--use_sp` | | |
| ### 4 Β· Prompt rewriting (recommended for short / English inputs) | |
| NAVA is trained on Chinese dense captions; short or English prompts benefit substantially from rewriting before inference. Three pathways are provided, all sharing the same system prompt and sampling profile (so output style stays consistent), with `<S>...<E>` speech spans preserved verbatim. | |
| | Pathway | Backend | Speed | Best for | | |
| |---|---|---|---| | |
| | **vLLM batch server** (`pe_src/`) | Qwen3-4B-Thinking-2507 served via vLLM, async HTTP | **< 2 s** / prompt | Offline batches | | |
| | **Local transformers, single** (`gradio_demo/rewrite_single.py`) | Same model, in-process | 40β80 s / prompt | One-off CLI | | |
| | **Gradio "Rewrite" button** | Same as above, hosted in Gradio | 40β80 s / prompt | Interactive UI | | |
| ```bash | |
| # Batch path: start vLLM server, then rewrite a txt of prompts | |
| bash pe_src/start_server.sh --gpu 0 --low-footprint | |
| python pe_src/rewrite.py -i prompts.txt -o prompts_rewritten.txt | |
| ``` | |
| ### 5 Β· Gradio Web UI | |
| Interactive demo with click-to-rewrite (Qwen3-4B), image upload, and reference-WAV upload: | |
| ```bash | |
| bash gradio_demo/start_gradio.sh \ | |
| --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \ | |
| --ckpt NAVA.ckpt \ | |
| --rewrite_model pe_src/Qwen3-4B-Thinking-2507 \ | |
| --port 8000 --nproc 8 | |
| ``` | |
| <details> | |
| <summary><b>Debug mode (no models, UI only)</b></summary> | |
| ```bash | |
| python gradio_demo/gradio_server.py --debug --port 8000 | |
| ``` | |
| </details> | |
| --- | |
| ## Bias, Safety, and Misuse | |
| NAVA can synthesize video and speech conditioned on a reference image (`image_path`) and reference voice (`spk_wavs`). Using it to depict real persons without consent β including face-likeness or voice-likeness reproduction β is prohibited by the license and may also be illegal in your jurisdiction. We recommend: | |
| 1. Only use **consent-approved** reference media. | |
| 2. **Label generated content as synthetic.** | |
| 3. Apply **provenance / watermarking** before redistribution. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @article{nava2026, | |
| title = {NAVA: Native Audio-Visual Alignment for Joint Audio-Video Generation}, | |
| author = {ERNIE Team}, | |
| journal = {arXiv preprint}, | |
| year = {2026}, | |
| } | |
| ``` | |
| ## Acknowledgements | |
| NAVA builds on excellent upstream work: **Wan2.2-TI2V-5B** (video backbone & VAE), **LTX 2.3** (audio VAE + built-in vocoder), **umt5-xxl** (text encoder), and **ReDimNet** (speaker embedding). We also thank the open-source AV-generation community β Ovi, MOVA, Davinci, LTX β for releasing strong baselines that made fair benchmarking possible. | |
| ## License & Contact | |
| Released under **Apache-2.0**. For research / commercial inquiries, contact the **ERNIE team at Baidu Inc.** | |