MOVA-360p / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag and link to paper
cc68d8e verified
|
raw
history blame
4.13 kB
metadata
license: apache-2.0
pipeline_tag: any-to-any
tags:
  - image-to-video
  - image-text-to-video
  - image-to-audio-video
  - image-text-to-audio-video
  - MOVA
  - OpenMOSS
  - SII
  - MOSI
  - sglang-diffusion

MOVA: Towards Scalable and Synchronized Video–Audio Generation

We introduce MOVA (MOSS Video and Audio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment.

🌟Key Highlights

  • Native Bimodal Generation: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation.
  • Precise Lip-Sync & Sound FX: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects.
  • Fully Open-Source: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
  • Asymmetric Dual-Tower Architecture: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.

Demo

Model Details

Model Description

MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis.

Model Sources

Model Usage

Please refer to the GitHub repository for environment setup and detailed instructions.

Sample Inference

Generate a video of single person speech:

export CP_SIZE=1
export CKPT_PATH=/path/to/MOVA-360p/

torchrun \
    --nproc_per_node=$CP_SIZE \
    scripts/inference_single.py \
    --ckpt_path $CKPT_PATH \
    --cp_size $CP_SIZE \
    --height 352 \
    --width 640 \
    --prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \
    --ref_path "./assets/single_person.jpg" \
    --output_path "./data/samples/single_person.mp4" \
    --seed 42 \
    --offload cpu

Evaluation

We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.

Citation

@article{yu2026mova,
  title={MOVA: Towards Scalable and Synchronized Video-Audio Generation},
  author={Donghua Yu and Mingshu Chen and Qi Chen and Qi Luo and Qianyi Wu and Qinyuan Cheng and others},
  journal={arXiv preprint arXiv:2602.08794},
  year={2026}
}