CCCCyx commited on
Commit
b06e30d
·
verified ·
1 Parent(s): f43bbd3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -10
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: MOSS-VL-SFT-0408
3
  date: 2026-04-08
4
  category: Multimodal-LLM
5
- status: SFT
6
  language:
7
  - en
8
  library_name: transformers
@@ -10,7 +10,7 @@ pipeline_tag: video-text-to-text
10
  license: apache-2.0
11
  base_model: fnlp-vision/moss-video-preview-base
12
  tags:
13
- - SFT
14
  - Video-Understanding
15
  - Image-Understanding
16
  - MOSS-VL
@@ -24,21 +24,27 @@ tags:
24
  <img src="assets/logo.png" width="320"/>
25
  </p>
26
 
27
- # MOSS-VL-SFT-0408
28
 
29
  ## 📌 Introduction
30
 
31
- We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
32
 
33
  > [!IMPORTANT]
34
- > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
35
 
36
- This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.
 
 
 
 
 
 
37
 
38
  ### This checkpoint is intended for:
39
 
40
- - **video/image understanding** with significantly improved instruction following capabilities.
41
- - Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.
42
 
43
  ---
44
 
@@ -57,7 +63,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
57
 
58
  ## 🏗 Model Architecture
59
 
60
- **MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
61
 
62
  <p align="center">
63
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
 
1
  ---
2
+ title: MOSS-VL-Base-0408
3
  date: 2026-04-08
4
  category: Multimodal-LLM
5
+ status: Base
6
  language:
7
  - en
8
  library_name: transformers
 
10
  license: apache-2.0
11
  base_model: fnlp-vision/moss-video-preview-base
12
  tags:
13
+ - Base
14
  - Video-Understanding
15
  - Image-Understanding
16
  - MOSS-VL
 
24
  <img src="assets/logo.png" width="320"/>
25
  </p>
26
 
27
+ # MOSS-VL-Base-0408
28
 
29
  ## 📌 Introduction
30
 
31
+ We introduce **MOSS-VL-Base-0408**, the base checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
32
 
33
  > [!IMPORTANT]
34
+ > This is a **base** checkpoint. It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning.
35
 
36
+ This model is trained through four stages of pretraining only:
37
+ 1. Stage 1: Vision-language alignment
38
+ 2. Stage 2: Large-scale multimodal pretraining
39
+ 3. Stage 3: High-quality multimodal pretraining
40
+ 4. Stage 4: Annealing and long-context extension
41
+
42
+ This model is designed as a high-performance offline engine for multimodal tasks and serves as a strong base foundation for downstream adaptation.
43
 
44
  ### This checkpoint is intended for:
45
 
46
+ - **video/image understanding** and general multimodal representation learning.
47
+ - Serving as a **strong starting point** for future SFT, alignment, or specific domain adaptation.
48
 
49
  ---
50
 
 
63
 
64
  ## 🏗 Model Architecture
65
 
66
+ **MOSS-VL-Base-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language modeling.
67
 
68
  <p align="center">
69
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>