OpenMOSS-Team
/

MOVA-720p

image-text-to-video

image-to-audio-video

image-text-to-audio-video

sglang-diffusion

Model card Files Files and versions

MOVA-720p / README.md

nielsr's picture

nielsr HF Staff

Add pipeline tag, paper link, and citation

f06e427 verified about 2 months ago

|

3.5 kB

	---
	library_name: diffusers
	license: apache-2.0
	pipeline_tag: any-to-any
	tags:
	- image-to-video
	- image-text-to-video
	- image-to-audio-video
	- image-text-to-audio-video
	- MOVA
	- OpenMOSS
	- SII
	- MOSI
	- sglang-diffusion
	---

	## MOVA: Towards Scalable and Synchronized Video–Audio Generation
	We introduce MOVA (MOSS Video and Audio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment.

	🌟Key Highlights
	- Native Bimodal Generation: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation.
	- Precise Lip-Sync & Sound FX: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects.
	- Fully Open-Source: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
	- Asymmetric Dual-Tower Architecture: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.

	## Demo

	<div align="center">
	<video width="70%" controls>
	<source src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/FyB5TeOkXgAhb76fA5Pbg.mp4" type="video/mp4">
	</video>
	</div>

	## Model Details

	### Model Description

	MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis.

	### Model Sources

	- Github: https://github.com/OpenMOSS/MOVA
	- Paper: [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794)
	- Project Page: https://mosi.cn/models/mova

	### Model Usage
	Please refer to the [Quick Start](https://github.com/OpenMOSS/MOVA#quick-start) section on the GitHub page for model usage and inference scripts.

	## Evaluation
	We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/Jr7I1qaSWK3x_Tfsxn9nP.png" width="600"/>
	<p>

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/>
	<p>

	## Citation

	```bibtex
	@article{yu2026mova,
	title={MOVA: Towards Scalable and Synchronized Video-Audio Generation},
	author={Yu, Donghua and Chen, Mingshu and Chen, Qi and Luo, Qi and Wu, Qianyi and Cheng, Qinyuan and Li, Ruixiao and Liang, Tianyi and Zhang, Wenbo and Tu, Wenming and others},
	journal={arXiv preprint arXiv:2602.08794},
	year={2026}
	}
	```