NAVA / README.md
robingg1's picture
Upload README.md with huggingface_hub
2e8f1b7 verified
metadata
license: apache-2.0
language:
  - en
  - zh
tags:
  - text-to-video
  - text-to-audio-video
  - audio-video-generation
  - mmdit
  - flow-matching
  - wan2.2
pipeline_tag: text-to-video
library_name: custom
base_model: Wan-AI/Wan2.2-TI2V-5B

NAVA

NAVA β€” Native Audio-Visual Alignment for Generation

State-of-the-art audio-visual synchronization with only 6.3 B parameters.

arXiv Code Project Page License Params Base model

ERNIE Team Β· Baidu Inc. Β· arXiv 2026

⭐ If you find this model useful, please consider giving our GitHub repo a star! ⭐

πŸ“– δΈ­ζ–‡η‰ˆ README


TL;DR

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt β€” including multi-speaker speech with reference-timbre control and image-conditioned continuations.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2Γ— to 5Γ— fewer parameters than open-source baselines.

Highlights

  • 720p 1-min Fast Generation β€” 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel.
  • Dual-Channel Audio β€” stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
  • Precise Multi-Timbre Control β€” reference WAVs bound to <S>...<E> speech spans for per-speaker voice identity.
  • Language-Described Camera Control β€” shot composition, motion, and pacing directly from the prompt.
  • Multi-Resolution β€” landscape / portrait / square aspect ratios from the same checkpoint.

Model Details

Quick Facts

Architecture Align-then-Fuse MMDiT (Wan2.2 backbone)
Parameters 6.3 B (backbone, joint AV)
Modality Joint audio + video, text-conditioned
Resolution 1280Γ—704 (recommended) Β· 960Γ—960 also supported
Frames / FPS 37 frames @ 24 fps β‰ˆ 6 s Β· 55–61 frames β‰ˆ 9–10 s
Audio 25 latent tokens / sec, ≀ 10 s
Sampling Flow matching Β· UniPC scheduler Β· 50 default steps
Precision bf16
Parallelism Single-GPU or Ulysses sequence parallel (up to 8 GPUs)
Base model Wan-AI/Wan2.2-TI2V-5B

Architecture

NAVA Architecture

NAVA instantiates Native Audio-Visual Alignment as an Align-then-Fuse MMDiT stack:

  • Hierarchical Alignment Layers β€” 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated [video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.
  • Unified Fusion Layers β€” 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
  • Backbone hyperparameters. dim=3072, ffn_dim=14336, 24 attention heads, 30 layers (10 double + 20 single), text_len=512, patch size (1, 2, 2). RMSNorm on QK; cross-attention norm; Ξ΅ = 1e-6.
  • Positional encoding. 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
  • Timbre-in-Context Conditioning. Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to <S>...<E> speech spans, enabling per-speaker timbre control in multi-speaker scenes.
  • 3D cross-modal CFG. Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (video_align_guidance_scale, audio_align_guidance_scale) keep AV synchronization tight at inference.

What's Different from Existing Open-Source AV Models

Design axis Typical baselines NAVA
Stream layout Dual-tower (post-hoc align) or fully unified tri-modal Align-then-Fuse β€” alignment space first, context fused after
Speech control Caption-only, no per-speaker timbre Timbre-in-Context via reference WAVs
Param budget 10 B – 32 B 6.3 B

Components Shipped Alongside the Backbone

Component Description Size
WanAVModel (backbone) MMDiT, joint AV attention 6.3 B
Wan2.2 Video VAE Causal 3D ConvNet Β· 16Γ—16Γ—4 spatial-temporal compression Β· 48 latent channels 2.7 GB
LTX Audio VAE + Vocoder 128 latent channels Β· 25 tokens/sec Β· built-in waveform decoder 348 MB
umt5-xxl Text Encoder T5 Β· 4096-d embeddings 11 GB
ReDimNet Speaker embedding Β· 192-d ~50 MB

Evaluation

Table 1 β€” VerseBench (general AV capability)

NAVA achieves the best AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.

Model Params Resolution Sync-C ↑ Sync-D ↓ IB ↑ Video Quality ↑ WER ↓ PQ ↑ FD ↓
Ovi 1.1 10 B 720p 7.4839 7.9791 0.199 0.636 0.102 5.8432 0.9418
MOVA A18B (32 B) 720p 7.2888 7.808 0.269 0.603 0.126 7.2331 0.9222
Davinci 15 B 540p 7.1487 7.8158 0.269 0.600 0.151 5.9559 0.9307
LTX 2.3 19 B 512p 7.2476 7.6902 0.337 0.576 0.106 6.9459 0.8287
NAVA (ours) 6.3 B 720p 7.7914 7.5655 0.313 0.659 0.099 6.8609 0.8328

↑ higher is better Β· ↓ lower is better Β· bold = best Β· underline = 2nd best.

Table 2 β€” Seed-TTS-eval (speech quality)

Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed for reference; they are not directly comparable.

Category Model WER ↓ Speaker Similarity ↑
Audio-Only (reference) CosyVoice 4.29 60.9
Audio-Only (reference) Qwen2.5-Omni 2.72 63.2
Audio-Video Joint DreamID-Omni 33.44 34.1
Audio-Video Joint NAVA (ours) 5.81 62.4

How to Use

TL;DR command. After Β§1 setup is complete:

bash scripts/inference.sh           # General T2AV
bash scripts/inference_timbre.sh    # I2AV + timbre control

Outputs land under eval_results/.

1 Β· Setup (once)

git clone https://github.com/ernie-research/NAVA && cd NAVA

# Python deps
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation

# All weights in one shot β€” main checkpoint + Wan2.2 VAE + T5 + LTX audio VAE
huggingface-cli download <NAVA-repo-id> --local-dir .
Expected on-disk layout
NAVA/
β”œβ”€β”€ NAVA.ckpt                                                    # main checkpoint (24 GB)
β”œβ”€β”€ Wan2.2-TI2V-5B/
β”‚   β”œβ”€β”€ Wan2.2_VAE.pth                                           # 2.7 GB
β”‚   β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth                          # 11 GB
β”‚   └── google/umt5-xxl/{spiece.model, tokenizer.json}
β”œβ”€β”€ params/
β”‚   └── LTX2/
β”‚       β”œβ”€β”€ ltx-2.3-22b-dev_audio_vae.safetensors                # 348 MB
β”‚       └── LICENSE                                              # LTX-2 Community License
└── configs/                                                     # inference YAMLs

The LTX audio-VAE Python code is vendored under nava_src/vendor/ltx_core/ (see its NOTICE.md), so no separate clone of the LTX-Video repo is needed. ReDimNet is fetched via torch.hub on first run.

2 Β· One-command inference (recommended, 8 GPU SP)

The repo ships two end-to-end scripts that build a JSONL inline and launch SP=8 inference:

# General T2AV (text-only)
bash scripts/inference.sh

# I2AV + Timbre Control (first-frame image + reference voice)
bash scripts/inference_timbre.sh

Override defaults via env vars:

CKPT=/path/to/NAVA.ckpt OUT_DIR=eval_results/run1 bash scripts/inference.sh
TIMBRE_SCALE=3.0 SPK_WAV=/path/to/spk.wav    bash scripts/inference_timbre.sh

3 Β· Custom batches β€” write your own JSONL

Each line is one prompt:

{"prompt": "δΈ€δ½η”·ε­εœ¨ζ΅·θΎΉε₯”θ·‘οΌŒι•œε€΄θ·Ÿιšγ€‚θƒŒζ™―是桷ζ΅ͺε£°ε’Œι£Žε£°γ€‚"}
{"prompt": "一人对话<S>Hello<E><S>Hi there<E>", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "ι•œε€΄θ·ŸιšδΈ»δ½“...", "image_path": "/abs/path/first_frame.png"}
Field Required Description
prompt yes Text caption (also accepts legacy text field name)
image_path no Absolute path to first-frame image β€” auto-enables I2V for this sample
spk_wavs no List of absolute paths to speaker reference WAVs (max 2)

Then launch:

SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
    --nnodes=1 --nproc_per_node=8 \
    --master_addr=127.0.0.1 --master_port=29507 \
    inference_nava.py \
    --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
    --ckpt NAVA.ckpt \
    --out_dir ./outputs \
    --data_format json --data_file my_prompts.jsonl \
    --width 1280 --height 704 --frames 37 --fps 24 \
    --steps 50 --save_sample --gen_turn 1 --use_sp

Outputs land at outputs/{save_path}-{gen_turn}_av.mp4. For timbre-controlled samples, also pass --timbre_cfg --timbre_align_guidance_scale 3.0.

Mode cheatsheet

Goal JSONL fields Extra flags
Text β†’ AV prompt β€”
Image β†’ AV prompt + image_path (auto-detected)
Timbre-controlled speech prompt + spk_wavs --timbre_cfg --timbre_align_guidance_scale 3.0
9-second video any --frames 55
Single-GPU (slower) any omit --use_sp

4 Β· Prompt rewriting (recommended for short / English inputs)

NAVA is trained on Chinese dense captions; short or English prompts benefit substantially from rewriting before inference. Three pathways are provided, all sharing the same system prompt and sampling profile (so output style stays consistent), with <S>...<E> speech spans preserved verbatim.

Pathway Backend Speed Best for
vLLM batch server (pe_src/) Qwen3-4B-Thinking-2507 served via vLLM, async HTTP < 2 s / prompt Offline batches
Local transformers, single (gradio_demo/rewrite_single.py) Same model, in-process 40–80 s / prompt One-off CLI
Gradio "Rewrite" button Same as above, hosted in Gradio 40–80 s / prompt Interactive UI
# Batch path: start vLLM server, then rewrite a txt of prompts
bash pe_src/start_server.sh --gpu 0 --low-footprint
python pe_src/rewrite.py -i prompts.txt -o prompts_rewritten.txt

5 Β· Gradio Web UI

Interactive demo with click-to-rewrite (Qwen3-4B), image upload, and reference-WAV upload:

bash gradio_demo/start_gradio.sh \
    --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
    --ckpt NAVA.ckpt \
    --rewrite_model pe_src/Qwen3-4B-Thinking-2507 \
    --port 8000 --nproc 8
Debug mode (no models, UI only)
python gradio_demo/gradio_server.py --debug --port 8000

Bias, Safety, and Misuse

NAVA can synthesize video and speech conditioned on a reference image (image_path) and reference voice (spk_wavs). Using it to depict real persons without consent β€” including face-likeness or voice-likeness reproduction β€” is prohibited by the license and may also be illegal in your jurisdiction. We recommend:

  1. Only use consent-approved reference media.
  2. Label generated content as synthetic.
  3. Apply provenance / watermarking before redistribution.

Citation

@article{nava2026,
  title   = {NAVA: Native Audio-Visual Alignment for Joint Audio-Video Generation},
  author  = {ERNIE Team},
  journal = {arXiv preprint},
  year    = {2026},
}

Acknowledgements

NAVA builds on excellent upstream work: Wan2.2-TI2V-5B (video backbone & VAE), LTX 2.3 (audio VAE + built-in vocoder), umt5-xxl (text encoder), and ReDimNet (speaker embedding). We also thank the open-source AV-generation community β€” Ovi, MOVA, Davinci, LTX β€” for releasing strong baselines that made fair benchmarking possible.

License & Contact

Released under Apache-2.0. For research / commercial inquiries, contact the ERNIE team at Baidu Inc.