Video-Text-to-Text
Transformers
Safetensors
English
moss_vl
feature-extraction
SFT
Video-Understanding
Image-Understanding
MOSS-VL
OpenMOSS
multimodal
video
vision-language
custom_code
Instructions to use OpenMOSS-Team/MOSS-VL-Instruct-0408 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-VL-Instruct-0408 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-VL-Instruct-0408", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this che
|
|
| 34 |
|
| 35 |
### ✨ Highlights
|
| 36 |
|
| 37 |
-
- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME,
|
| 38 |
- 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
|
| 39 |
- 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
|
| 40 |
|
|
|
|
| 34 |
|
| 35 |
### ✨ Highlights
|
| 36 |
|
| 37 |
+
- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
|
| 38 |
- 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
|
| 39 |
- 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
|
| 40 |
|