| --- |
| library_name: diffusers |
| license: apache-2.0 |
| pipeline_tag: any-to-any |
| tags: |
| - image-to-video |
| - image-text-to-video |
| - image-to-audio-video |
| - image-text-to-audio-video |
| - MOVA |
| - OpenMOSS |
| - SII |
| - MOSI |
| - sglang-diffusion |
| --- |
| |
| ## MOVA: Towards Scalable and Synchronized Video–Audio Generation |
| We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment. |
|
|
| 🌟Key Highlights |
| - **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation. |
| - **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects. |
| - **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. |
| - **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction. |
|
|
| ## Demo |
|
|
| <div align="center"> |
| <video width="70%" controls> |
| <source src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/FyB5TeOkXgAhb76fA5Pbg.mp4" type="video/mp4"> |
| </video> |
| </div> |
| |
| ## Model Details |
|
|
| ### Model Description |
|
|
| MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis. |
|
|
| ### Model Sources |
|
|
| - **Github:** https://github.com/OpenMOSS/MOVA |
| - **Paper:** [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794) |
| - **Project Page:** https://mosi.cn/models/mova |
|
|
| ### Model Usage |
| Please refer to the [Quick Start](https://github.com/OpenMOSS/MOVA#quick-start) section on the GitHub page for model usage and inference scripts. |
|
|
| ## Evaluation |
| We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models. |
|
|
| <p align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/Jr7I1qaSWK3x_Tfsxn9nP.png" width="600"/> |
| <p> |
| |
| <p align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/> |
| <p> |
| |
| ## Citation |
|
|
| ```bibtex |
| @article{yu2026mova, |
| title={MOVA: Towards Scalable and Synchronized Video-Audio Generation}, |
| author={Yu, Donghua and Chen, Mingshu and Chen, Qi and Luo, Qi and Wu, Qianyi and Cheng, Qinyuan and Li, Ruixiao and Liang, Tianyi and Zhang, Wenbo and Tu, Wenming and others}, |
| journal={arXiv preprint arXiv:2602.08794}, |
| year={2026} |
| } |
| ``` |