Update README.md
Browse files
README.md
CHANGED
|
@@ -27,9 +27,9 @@ tags:
|
|
| 27 |
|
| 28 |
## π Introduction
|
| 29 |
|
| 30 |
-
MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing
|
| 31 |
|
| 32 |
-
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
|
| 33 |
|
| 34 |
1. Stage 1: Vision-language alignment
|
| 35 |
2. Stage 2: Large-scale multimodal pretraining
|
|
@@ -42,12 +42,6 @@ Built through four stages of multimodal pretraining only, this checkpoint serves
|
|
| 42 |
- πΌοΈ **Strong General Multimodal Perception** β Covers single-image, multi-image, and mixed-modality offline understanding workloads.
|
| 43 |
- π§± **Robust Base for Adaptation** β Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
|
| 44 |
|
| 45 |
-
### π Note on Variants
|
| 46 |
-
|
| 47 |
-
> [!IMPORTANT]
|
| 48 |
-
> **This is the base checkpoint.** It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning, and it is not the streaming variant. If you are looking for a user-facing instruction-following model, please refer to the corresponding instruct release.
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
|
| 52 |
## π Model Architecture
|
| 53 |
|
|
|
|
| 27 |
|
| 28 |
## π Introduction
|
| 29 |
|
| 30 |
+
MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
|
| 31 |
|
| 32 |
+
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
|
| 33 |
|
| 34 |
1. Stage 1: Vision-language alignment
|
| 35 |
2. Stage 2: Large-scale multimodal pretraining
|
|
|
|
| 42 |
- πΌοΈ **Strong General Multimodal Perception** β Covers single-image, multi-image, and mixed-modality offline understanding workloads.
|
| 43 |
- π§± **Robust Base for Adaptation** β Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## π Model Architecture
|
| 47 |
|