Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
---
|
| 2 |
-
title: MOSS-VL-
|
| 3 |
date: 2026-04-08
|
| 4 |
category: Multimodal-LLM
|
| 5 |
-
status:
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
library_name: transformers
|
|
@@ -10,7 +10,7 @@ pipeline_tag: video-text-to-text
|
|
| 10 |
license: apache-2.0
|
| 11 |
base_model: fnlp-vision/moss-video-preview-base
|
| 12 |
tags:
|
| 13 |
-
-
|
| 14 |
- Video-Understanding
|
| 15 |
- Image-Understanding
|
| 16 |
- MOSS-VL
|
|
@@ -24,21 +24,27 @@ tags:
|
|
| 24 |
<img src="assets/logo.png" width="320"/>
|
| 25 |
</p>
|
| 26 |
|
| 27 |
-
# MOSS-VL-
|
| 28 |
|
| 29 |
## 📌 Introduction
|
| 30 |
|
| 31 |
-
We introduce **MOSS-VL-
|
| 32 |
|
| 33 |
> [!IMPORTANT]
|
| 34 |
-
> This is
|
| 35 |
|
| 36 |
-
This model is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
### This checkpoint is intended for:
|
| 39 |
|
| 40 |
-
- **video/image understanding**
|
| 41 |
-
- Serving as a **strong starting point** for
|
| 42 |
|
| 43 |
---
|
| 44 |
|
|
@@ -57,7 +63,7 @@ This model is designed as a high-performance offline engine for multimodal tasks
|
|
| 57 |
|
| 58 |
## 🏗 Model Architecture
|
| 59 |
|
| 60 |
-
**MOSS-VL-
|
| 61 |
|
| 62 |
<p align="center">
|
| 63 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
|
|
|
| 1 |
---
|
| 2 |
+
title: MOSS-VL-Base-0408
|
| 3 |
date: 2026-04-08
|
| 4 |
category: Multimodal-LLM
|
| 5 |
+
status: Base
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
library_name: transformers
|
|
|
|
| 10 |
license: apache-2.0
|
| 11 |
base_model: fnlp-vision/moss-video-preview-base
|
| 12 |
tags:
|
| 13 |
+
- Base
|
| 14 |
- Video-Understanding
|
| 15 |
- Image-Understanding
|
| 16 |
- MOSS-VL
|
|
|
|
| 24 |
<img src="assets/logo.png" width="320"/>
|
| 25 |
</p>
|
| 26 |
|
| 27 |
+
# MOSS-VL-Base-0408
|
| 28 |
|
| 29 |
## 📌 Introduction
|
| 30 |
|
| 31 |
+
We introduce **MOSS-VL-Base-0408**, the base checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
|
| 32 |
|
| 33 |
> [!IMPORTANT]
|
| 34 |
+
> This is a **base** checkpoint. It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning.
|
| 35 |
|
| 36 |
+
This model is trained through four stages of pretraining only:
|
| 37 |
+
1. Stage 1: Vision-language alignment
|
| 38 |
+
2. Stage 2: Large-scale multimodal pretraining
|
| 39 |
+
3. Stage 3: High-quality multimodal pretraining
|
| 40 |
+
4. Stage 4: Annealing and long-context extension
|
| 41 |
+
|
| 42 |
+
This model is designed as a high-performance offline engine for multimodal tasks and serves as a strong base foundation for downstream adaptation.
|
| 43 |
|
| 44 |
### This checkpoint is intended for:
|
| 45 |
|
| 46 |
+
- **video/image understanding** and general multimodal representation learning.
|
| 47 |
+
- Serving as a **strong starting point** for future SFT, alignment, or specific domain adaptation.
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
|
|
| 63 |
|
| 64 |
## 🏗 Model Architecture
|
| 65 |
|
| 66 |
+
**MOSS-VL-Base-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language modeling.
|
| 67 |
|
| 68 |
<p align="center">
|
| 69 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|