| --- |
| license: apache-2.0 |
| pipeline_tag: any-to-any |
| tags: |
| - image-to-video |
| - image-text-to-video |
| - image-to-audio-video |
| - image-text-to-audio-video |
| - MOVA |
| - OpenMOSS |
| - SII |
| - MOSI |
| - sglang-diffusion |
| --- |
| |
| ## MOVA: Towards Scalable and Synchronized Video–Audio Generation |
| We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment. |
|
|
| 🌟Key Highlights |
| - **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation. |
| - **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects. |
| - **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. |
| - **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction. |
|
|
| ## Demo |
|
|
| <div align="center"> |
| <video width="70%" controls> |
| <source src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/FyB5TeOkXgAhb76fA5Pbg.mp4" type="video/mp4"> |
| </video> |
| </div> |
| |
| ## Model Details |
|
|
| ### Model Description |
|
|
| MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis. |
|
|
| ### Model Sources |
|
|
| - **Project Page:** https://mosi.cn/models/mova |
| - **Github:** https://github.com/OpenMOSS/MOVA |
| - **Paper:** [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794) |
|
|
| ## Model Usage |
|
|
| Please refer to the [GitHub repository](https://github.com/OpenMOSS/MOVA) for environment setup and detailed instructions. |
|
|
| ### Sample Inference |
|
|
| Generate a video of single person speech: |
| ```bash |
| export CP_SIZE=1 |
| export CKPT_PATH=/path/to/MOVA-360p/ |
| |
| torchrun \ |
| --nproc_per_node=$CP_SIZE \ |
| scripts/inference_single.py \ |
| --ckpt_path $CKPT_PATH \ |
| --cp_size $CP_SIZE \ |
| --height 352 \ |
| --width 640 \ |
| --prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \ |
| --ref_path "./assets/single_person.jpg" \ |
| --output_path "./data/samples/single_person.mp4" \ |
| --seed 42 \ |
| --offload cpu |
| ``` |
|
|
| ## Evaluation |
| We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models. |
|
|
| <p align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/Jr7I1qaSWK3x_Tfsxn9nP.png" width="600"/> |
| <p> |
| |
| <p align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/> |
| <p> |
| |
| ## Citation |
|
|
| ```bibtex |
| @article{yu2026mova, |
| title={MOVA: Towards Scalable and Synchronized Video-Audio Generation}, |
| author={Donghua Yu and Mingshu Chen and Qi Chen and Qi Luo and Qianyi Wu and Qinyuan Cheng and others}, |
| journal={arXiv preprint arXiv:2602.08794}, |
| year={2026} |
| } |
| ``` |