File size: 4,125 Bytes
20d9a47 b3957b7 cc68d8e cdbe29f 81c4960 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 cc68d8e b3957b7 cc68d8e 20d9a47 cc68d8e 20d9a47 b3957b7 20d9a47 b3957b7 20d9a47 b3957b7 cba4ee5 b3957b7 20d9a47 cc68d8e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | ---
license: apache-2.0
pipeline_tag: any-to-any
tags:
- image-to-video
- image-text-to-video
- image-to-audio-video
- image-text-to-audio-video
- MOVA
- OpenMOSS
- SII
- MOSI
- sglang-diffusion
---
## MOVA: Towards Scalable and Synchronized Video–Audio Generation
We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment.
🌟Key Highlights
- **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation.
- **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects.
- **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
- **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.
## Demo
<div align="center">
<video width="70%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/FyB5TeOkXgAhb76fA5Pbg.mp4" type="video/mp4">
</video>
</div>
## Model Details
### Model Description
MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis.
### Model Sources
- **Project Page:** https://mosi.cn/models/mova
- **Github:** https://github.com/OpenMOSS/MOVA
- **Paper:** [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794)
## Model Usage
Please refer to the [GitHub repository](https://github.com/OpenMOSS/MOVA) for environment setup and detailed instructions.
### Sample Inference
Generate a video of single person speech:
```bash
export CP_SIZE=1
export CKPT_PATH=/path/to/MOVA-360p/
torchrun \
--nproc_per_node=$CP_SIZE \
scripts/inference_single.py \
--ckpt_path $CKPT_PATH \
--cp_size $CP_SIZE \
--height 352 \
--width 640 \
--prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \
--ref_path "./assets/single_person.jpg" \
--output_path "./data/samples/single_person.mp4" \
--seed 42 \
--offload cpu
```
## Evaluation
We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/Jr7I1qaSWK3x_Tfsxn9nP.png" width="600"/>
<p>
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/>
<p>
## Citation
```bibtex
@article{yu2026mova,
title={MOVA: Towards Scalable and Synchronized Video-Audio Generation},
author={Donghua Yu and Mingshu Chen and Qi Chen and Qi Luo and Qianyi Wu and Qinyuan Cheng and others},
journal={arXiv preprint arXiv:2602.08794},
year={2026}
}
``` |