Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +14 -5
- assets/MOSS-VL-Benchmark.png +3 -0
.gitattributes
CHANGED
|
@@ -38,3 +38,4 @@ assets/3d-rope.png filter=lfs diff=lfs merge=lfs -text
|
|
| 38 |
assets/logo.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
assets/structure.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 38 |
assets/logo.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
assets/structure.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: MOSS-VL-
|
| 3 |
date: 2026-04-08
|
| 4 |
category: Multimodal-LLM
|
| 5 |
status: SFT
|
|
@@ -8,7 +8,7 @@ language:
|
|
| 8 |
library_name: transformers
|
| 9 |
pipeline_tag: video-text-to-text
|
| 10 |
license: apache-2.0
|
| 11 |
-
base_model: fnlp-vision/
|
| 12 |
tags:
|
| 13 |
- SFT
|
| 14 |
- Video-Understanding
|
|
@@ -24,11 +24,11 @@ tags:
|
|
| 24 |
<img src="assets/logo.png" width="320"/>
|
| 25 |
</p>
|
| 26 |
|
| 27 |
-
# MOSS-VL-
|
| 28 |
|
| 29 |
## π Introduction
|
| 30 |
|
| 31 |
-
We introduce **MOSS-VL-
|
| 32 |
|
| 33 |
> [!IMPORTANT]
|
| 34 |
> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
|
|
@@ -57,7 +57,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
|
|
| 57 |
|
| 58 |
## π Model Architecture
|
| 59 |
|
| 60 |
-
**MOSS-VL-
|
| 61 |
|
| 62 |
<p align="center">
|
| 63 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
|
@@ -87,6 +87,15 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
|
|
| 87 |
</p>
|
| 88 |
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
|
| 92 |
## π Quickstart
|
|
|
|
| 1 |
---
|
| 2 |
+
title: MOSS-VL-Instruct-0408
|
| 3 |
date: 2026-04-08
|
| 4 |
category: Multimodal-LLM
|
| 5 |
status: SFT
|
|
|
|
| 8 |
library_name: transformers
|
| 9 |
pipeline_tag: video-text-to-text
|
| 10 |
license: apache-2.0
|
| 11 |
+
base_model: fnlp-vision/moss-video-preview-base
|
| 12 |
tags:
|
| 13 |
- SFT
|
| 14 |
- Video-Understanding
|
|
|
|
| 24 |
<img src="assets/logo.png" width="320"/>
|
| 25 |
</p>
|
| 26 |
|
| 27 |
+
# MOSS-VL-Instruct-0408
|
| 28 |
|
| 29 |
## π Introduction
|
| 30 |
|
| 31 |
+
We introduce **MOSS-VL-Instruct-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
|
| 32 |
|
| 33 |
> [!IMPORTANT]
|
| 34 |
> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
|
|
|
|
| 57 |
|
| 58 |
## π Model Architecture
|
| 59 |
|
| 60 |
+
**MOSS-VL-Instruct-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
|
| 61 |
|
| 62 |
<p align="center">
|
| 63 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
|
|
|
| 87 |
</p>
|
| 88 |
|
| 89 |
|
| 90 |
+
## π Model Performance
|
| 91 |
+
|
| 92 |
+
We evaluate **MOSS-VL-Instruct-0408** across several key multimodal benchmarks, focusing on both video and image understanding.
|
| 93 |
+
|
| 94 |
+
<p align="center">
|
| 95 |
+
<img src="assets/MOSS-VL-Benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
|
| 96 |
+
<br>
|
| 97 |
+
<em>Figure 4: Performance comparison on mainstream multimodal benchmarks.</em>
|
| 98 |
+
</p>
|
| 99 |
|
| 100 |
|
| 101 |
## π Quickstart
|
assets/MOSS-VL-Benchmark.png
ADDED
|
Git LFS Details
|