OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 7

Commit

b06e30d

·

verified ·

1 Parent(s): f43bbd3

Upload README.md

Files changed (1) hide show

README.md +16 -10

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: MOSS-VL-SFT-0408
 date: 2026-04-08
 category: Multimodal-LLM
-status: SFT
 language:
 - en
 library_name: transformers
@@ -10,7 +10,7 @@ pipeline_tag: video-text-to-text
 license: apache-2.0
 base_model: fnlp-vision/moss-video-preview-base
 tags:
-- SFT
 - Video-Understanding
 - Image-Understanding
 - MOSS-VL
@@ -24,21 +24,27 @@ tags:
    <img src="assets/logo.png" width="320"/>
 </p>
-# MOSS-VL-SFT-0408
 ## 📌 Introduction
-We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
 > [!IMPORTANT]
-> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
-This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.
 ### This checkpoint is intended for:
--   **video/image understanding** with significantly improved instruction following capabilities.
--   Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.
 ---
@@ -57,7 +63,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
 ## 🏗 Model Architecture
-**MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>

 ---
+title: MOSS-VL-Base-0408
 date: 2026-04-08
 category: Multimodal-LLM
+status: Base
 language:
 - en
 library_name: transformers
 license: apache-2.0
 base_model: fnlp-vision/moss-video-preview-base
 tags:
+- Base
 - Video-Understanding
 - Image-Understanding
 - MOSS-VL
    <img src="assets/logo.png" width="320"/>
 </p>
+# MOSS-VL-Base-0408
 ## 📌 Introduction
+We introduce **MOSS-VL-Base-0408**, the base checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
 > [!IMPORTANT]
+> This is a **base** checkpoint. It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning.
+This model is trained through four stages of pretraining only:
+1. Stage 1: Vision-language alignment
+2. Stage 2: Large-scale multimodal pretraining
+3. Stage 3: High-quality multimodal pretraining
+4. Stage 4: Annealing and long-context extension
+This model is designed as a high-performance offline engine for multimodal tasks and serves as a strong base foundation for downstream adaptation.
 ### This checkpoint is intended for:
+-   **video/image understanding** and general multimodal representation learning.
+-   Serving as a **strong starting point** for future SFT, alignment, or specific domain adaptation.
 ---
 ## 🏗 Model Architecture
+**MOSS-VL-Base-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language modeling.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>