findcard12138 commited on
Commit
178eb0e
Β·
verified Β·
1 Parent(s): 3cae53c

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +14 -5
  3. assets/MOSS-VL-Benchmark.png +3 -0
.gitattributes CHANGED
@@ -38,3 +38,4 @@ assets/3d-rope.png filter=lfs diff=lfs merge=lfs -text
38
  assets/logo.png filter=lfs diff=lfs merge=lfs -text
39
  assets/structure.png filter=lfs diff=lfs merge=lfs -text
40
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
38
  assets/logo.png filter=lfs diff=lfs merge=lfs -text
39
  assets/structure.png filter=lfs diff=lfs merge=lfs -text
40
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
41
+ assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: MOSS-VL-SFT-0408
3
  date: 2026-04-08
4
  category: Multimodal-LLM
5
  status: SFT
@@ -8,7 +8,7 @@ language:
8
  library_name: transformers
9
  pipeline_tag: video-text-to-text
10
  license: apache-2.0
11
- base_model: fnlp-vision/mossvl_base_0408
12
  tags:
13
  - SFT
14
  - Video-Understanding
@@ -24,11 +24,11 @@ tags:
24
  <img src="assets/logo.png" width="320"/>
25
  </p>
26
 
27
- # MOSS-VL-SFT-0408
28
 
29
  ## πŸ“Œ Introduction
30
 
31
- We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
32
 
33
  > [!IMPORTANT]
34
  > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
@@ -57,7 +57,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
57
 
58
  ## πŸ— Model Architecture
59
 
60
- **MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
61
 
62
  <p align="center">
63
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
@@ -87,6 +87,15 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
87
  </p>
88
 
89
 
 
 
 
 
 
 
 
 
 
90
 
91
 
92
  ## πŸš€ Quickstart
 
1
  ---
2
+ title: MOSS-VL-Instruct-0408
3
  date: 2026-04-08
4
  category: Multimodal-LLM
5
  status: SFT
 
8
  library_name: transformers
9
  pipeline_tag: video-text-to-text
10
  license: apache-2.0
11
+ base_model: fnlp-vision/moss-video-preview-base
12
  tags:
13
  - SFT
14
  - Video-Understanding
 
24
  <img src="assets/logo.png" width="320"/>
25
  </p>
26
 
27
+ # MOSS-VL-Instruct-0408
28
 
29
  ## πŸ“Œ Introduction
30
 
31
+ We introduce **MOSS-VL-Instruct-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
32
 
33
  > [!IMPORTANT]
34
  > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
 
57
 
58
  ## πŸ— Model Architecture
59
 
60
+ **MOSS-VL-Instruct-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
61
 
62
  <p align="center">
63
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
 
87
  </p>
88
 
89
 
90
+ ## πŸ“Š Model Performance
91
+
92
+ We evaluate **MOSS-VL-Instruct-0408** across several key multimodal benchmarks, focusing on both video and image understanding.
93
+
94
+ <p align="center">
95
+ <img src="assets/MOSS-VL-Benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
96
+ <br>
97
+ <em>Figure 4: Performance comparison on mainstream multimodal benchmarks.</em>
98
+ </p>
99
 
100
 
101
  ## πŸš€ Quickstart
assets/MOSS-VL-Benchmark.png ADDED

Git LFS Details

  • SHA256: 512a159fa15715fab83429e83130eaface7a7fc7080654302859c41a60a0ece3
  • Pointer size: 131 Bytes
  • Size of remote file: 865 kB