--- license: apache-2.0 pipeline_tag: any-to-any tags: - image-to-video - image-text-to-video - image-to-audio-video - image-text-to-audio-video - MOVA - OpenMOSS - SII - MOSI - sglang-diffusion --- ## MOVA: Towards Scalable and Synchronized Video–Audio Generation We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment. 🌟Key Highlights - **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation. - **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects. - **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. - **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction. ## Demo

## Model Details ### Model Description MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis. ### Model Sources - **Project Page:** https://mosi.cn/models/mova - **Github:** https://github.com/OpenMOSS/MOVA - **Paper:** [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794) ## Model Usage Please refer to the [GitHub repository](https://github.com/OpenMOSS/MOVA) for environment setup and detailed instructions. ### Sample Inference Generate a video of single person speech: ```bash export CP_SIZE=1 export CKPT_PATH=/path/to/MOVA-360p/ torchrun \ --nproc_per_node=$CP_SIZE \ scripts/inference_single.py \ --ckpt_path $CKPT_PATH \ --cp_size $CP_SIZE \ --height 352 \ --width 640 \ --prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \ --ref_path "./assets/single_person.jpg" \ --output_path "./data/samples/single_person.mp4" \ --seed 42 \ --offload cpu ``` ## Evaluation We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.

## Citation ```bibtex @article{yu2026mova, title={MOVA: Towards Scalable and Synchronized Video-Audio Generation}, author={Donghua Yu and Mingshu Chen and Qi Chen and Qi Luo and Qianyi Wu and Qinyuan Cheng and others}, journal={arXiv preprint arXiv:2602.08794}, year={2026} } ```