OpenMOSS-Team
/

MOVA-360p

image-text-to-video

image-to-audio-video

image-text-to-audio-video

sglang-diffusion

Model card Files Files and versions

Add pipeline tag and link to paper

#4

by nielsr HF Staff - opened Feb 11

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

Files changed (1) hide show

README.md +37 -4

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-library_name: diffusers
 license: apache-2.0
 tags:
 - image-to-video
 - image-text-to-video
@@ -38,11 +38,34 @@ MOVA addresses the limitations of proprietary systems like Sora 2 and Veo 3 by o
 ### Model Sources
 - **Github:** https://github.com/OpenMOSS/MOVA
-- **Paper:** Coming soon.
-### Model Usage
-Please refer to the github page for model usage.
 ## Evaluation
 We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.
@@ -55,3 +78,13 @@ We evaluate our model through both objective benchmarks and subjective human eva
     <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/>
 <p>

 ---
 license: apache-2.0
+pipeline_tag: any-to-any
 tags:
 - image-to-video
 - image-text-to-video
 ### Model Sources
+- **Project Page:** https://mosi.cn/models/mova
 - **Github:** https://github.com/OpenMOSS/MOVA
+- **Paper:** [MOVA: Towards Scalable and Synchronized Video-Audio Generation](https://huggingface.co/papers/2602.08794)
+## Model Usage
+Please refer to the [GitHub repository](https://github.com/OpenMOSS/MOVA) for environment setup and detailed instructions.
+### Sample Inference
+Generate a video of single person speech:
+```bash
+export CP_SIZE=1
+export CKPT_PATH=/path/to/MOVA-360p/
+torchrun \
+    --nproc_per_node=$CP_SIZE \
+    scripts/inference_single.py \
+    --ckpt_path $CKPT_PATH \
+    --cp_size $CP_SIZE \
+    --height 352 \
+    --width 640 \
+    --prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \
+    --ref_path "./assets/single_person.jpg" \
+    --output_path "./data/samples/single_person.mp4" \
+    --seed 42 \
+    --offload cpu
+```
 ## Evaluation
 We evaluate our model through both objective benchmarks and subjective human evaluations. Below are the Elo scores and win rates comparing MOVA to existing open-source models.
     <img src="https://cdn-uploads.huggingface.co/production/uploads/64817b8550b759c75d5d1eeb/i5lgZI3NmxLXdJIxndcOp.png" width="1000"/>
 <p>
+## Citation
+```bibtex
+@article{yu2026mova,
+  title={MOVA: Towards Scalable and Synchronized Video-Audio Generation},
+  author={Donghua Yu and Mingshu Chen and Qi Chen and Qi Luo and Qianyi Wu and Qinyuan Cheng and others},
+  journal={arXiv preprint arXiv:2602.08794},
+  year={2026}
+}
+```