--- library_name: diffusers license: apache-2.0 pipeline_tag: any-to-any tags: - image-to-video - image-text-to-video - image-to-audio-video - image-text-to-audio-video - MOVA - OpenMOSS - SII - MOSI - sglang-diffusion --- ## MOVA: Towards Scalable and Synchronized Video–Audio Generation We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment. 🌟Key Highlights - **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation. - **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects. - **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. - **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction. ## Demo
## Citation ```bibtex @article{yu2026mova, title={MOVA: Towards Scalable and Synchronized Video-Audio Generation}, author={Yu, Donghua and Chen, Mingshu and Chen, Qi and Luo, Qi and Wu, Qianyi and Cheng, Qinyuan and Li, Ruixiao and Liang, Tianyi and Zhang, Wenbo and Tu, Wenming and others}, journal={arXiv preprint arXiv:2602.08794}, year={2026} } ```