OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

82b620c

·

verified ·

1 Parent(s): f0dde1e

Update README.md

Files changed (1) hide show

README.md +2 -8

README.md CHANGED Viewed

@@ -27,9 +27,9 @@ tags:
 ## 📌 Introduction
-MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
-Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation:
 1. Stage 1: Vision-language alignment
 2. Stage 2: Large-scale multimodal pretraining
@@ -42,12 +42,6 @@ Built through four stages of multimodal pretraining only, this checkpoint serves
 - 🖼️ **Strong General Multimodal Perception** — Covers single-image, multi-image, and mixed-modality offline understanding workloads.
 - 🧱 **Robust Base for Adaptation** — Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
-### 📝 Note on Variants
-> [!IMPORTANT]
-> **This is the base checkpoint.** It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning, and it is not the streaming variant. If you are looking for a user-facing instruction-following model, please refer to the corresponding instruct release.
----
 ## 🏗 Model Architecture

 ## 📌 Introduction
+MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
+Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
 1. Stage 1: Vision-language alignment
 2. Stage 2: Large-scale multimodal pretraining
 - 🖼️ **Strong General Multimodal Perception** — Covers single-image, multi-image, and mixed-modality offline understanding workloads.
 - 🧱 **Robust Base for Adaptation** — Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
 ## 🏗 Model Architecture