OpenMOSS-Team
/

MOSS-VL-Instruct-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

findcard12138 commited on Apr 8

Commit

03be5cf

·

verified ·

1 Parent(s): fd2f3e8

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -34,7 +34,7 @@ Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this che
 ### ✨ Highlights
-- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
 - 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
 - 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.

 ### ✨ Highlights
+- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
 - 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
 - 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.