Upload folder using huggingface_hub

Files changed (3) hide show

.gitattributes CHANGED Viewed

@@ -38,3 +38,4 @@ assets/3d-rope.png filter=lfs diff=lfs merge=lfs -text
 assets/logo.png filter=lfs diff=lfs merge=lfs -text
 assets/structure.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 assets/logo.png filter=lfs diff=lfs merge=lfs -text
 assets/structure.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: MOSS-VL-SFT-0408
 date: 2026-04-08
 category: Multimodal-LLM
 status: SFT
@@ -8,7 +8,7 @@ language:
 library_name: transformers
 pipeline_tag: video-text-to-text
 license: apache-2.0
-base_model: fnlp-vision/mossvl_base_0408
 tags:
 - SFT
 - Video-Understanding
@@ -24,11 +24,11 @@ tags:
    <img src="assets/logo.png" width="320"/>
 </p>
-# MOSS-VL-SFT-0408
 ## 📌 Introduction
-We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
 > [!IMPORTANT]
 > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
@@ -57,7 +57,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
 ## 🏗 Model Architecture
-**MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
@@ -87,6 +87,15 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
 </p>
 ## 🚀 Quickstart

 ---
+title: MOSS-VL-Instruct-0408
 date: 2026-04-08
 category: Multimodal-LLM
 status: SFT
 library_name: transformers
 pipeline_tag: video-text-to-text
 license: apache-2.0
+base_model: fnlp-vision/moss-video-preview-base
 tags:
 - SFT
 - Video-Understanding
    <img src="assets/logo.png" width="320"/>
 </p>
+# MOSS-VL-Instruct-0408
 ## 📌 Introduction
+We introduce **MOSS-VL-Instruct-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
 > [!IMPORTANT]
 > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
 ## 🏗 Model Architecture
+**MOSS-VL-Instruct-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
 </p>
+## 📊 Model Performance
+We evaluate **MOSS-VL-Instruct-0408** across several key multimodal benchmarks, focusing on both video and image understanding.
+<p align="center">
+    <img src="assets/MOSS-VL-Benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
+    <br>
+    <em>Figure 4: Performance comparison on mainstream multimodal benchmarks.</em>
+</p>
 ## 🚀 Quickstart

assets/MOSS-VL-Benchmark.png ADDED Viewed