|
|
--- |
|
|
library_name: MOVA |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- image-to-video |
|
|
- image-text-to-video |
|
|
- image-to-audio-video |
|
|
- image-text-to-audio-video |
|
|
- MOVA |
|
|
- OpenMOSS |
|
|
- SII |
|
|
- MOSI |
|
|
- sglang-diffusion |
|
|
--- |
|
|
|
|
|
## MOVA: Towards Scalable and Synchronized Video–Audio Generation |
|
|
We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment. |
|
|
|
|
|
🌟Key Highlights |
|
|
- **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation. |
|
|
- **Precise Lip-Sync & Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects. |
|
|
- **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. |
|
|
- **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction. |
|
|
|
|
|
## Demo |
|
|
|
|
|
<div align="center"> |
|
|
<video width="70%" controls> |
|
|
<source src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/FyB5TeOkXgAhb76fA5Pbg.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by offering a fully open-source framework for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks. The model employs an asymmetric dual-tower architecture fused via a bidirectional cross-attention mechanism, leveraging a Mixture-of-Experts (MoE) design with 32B total parameters (18B active during inference) to ensure high-quality synthesis with efficient deployment. Alongside the model weights, we provide a fine-grained bimodal data pipeline and support for LoRA fine-tuning, empowering the community to advance research in synchronized cinematic synthesis. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Github:** https://github.com/OpenMOSS/MOVA |
|
|
- **Paper:** Coming soon. |
|
|
|
|
|
### Model Usage |
|
|
Please refer to the github page for model usage. |
|
|
|
|
|
## Evaluation |
|
|
We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/Jr7I1qaSWK3x_Tfsxn9nP.png" width="600"/> |
|
|
<p> |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/> |
|
|
<p> |
|
|
|
|
|
|