CCCCyx commited on
Commit
82b620c
Β·
verified Β·
1 Parent(s): f0dde1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -8
README.md CHANGED
@@ -27,9 +27,9 @@ tags:
27
 
28
  ## πŸ“Œ Introduction
29
 
30
- MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
31
 
32
- Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation:
33
 
34
  1. Stage 1: Vision-language alignment
35
  2. Stage 2: Large-scale multimodal pretraining
@@ -42,12 +42,6 @@ Built through four stages of multimodal pretraining only, this checkpoint serves
42
  - πŸ–ΌοΈ **Strong General Multimodal Perception** β€” Covers single-image, multi-image, and mixed-modality offline understanding workloads.
43
  - 🧱 **Robust Base for Adaptation** β€” Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
44
 
45
- ### πŸ“ Note on Variants
46
-
47
- > [!IMPORTANT]
48
- > **This is the base checkpoint.** It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning, and it is not the streaming variant. If you are looking for a user-facing instruction-following model, please refer to the corresponding instruct release.
49
-
50
- ---
51
 
52
  ## πŸ— Model Architecture
53
 
 
27
 
28
  ## πŸ“Œ Introduction
29
 
30
+ MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
31
 
32
+ Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
33
 
34
  1. Stage 1: Vision-language alignment
35
  2. Stage 2: Large-scale multimodal pretraining
 
42
  - πŸ–ΌοΈ **Strong General Multimodal Perception** β€” Covers single-image, multi-image, and mixed-modality offline understanding workloads.
43
  - 🧱 **Robust Base for Adaptation** β€” Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
44
 
 
 
 
 
 
 
45
 
46
  ## πŸ— Model Architecture
47