OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

22a5abe

·

verified ·

1 Parent(s): 7379a34

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -29,7 +29,8 @@ tags:
 MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
-Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. Specifically, the pretraining pipeline is structured into the following four progressive stages:
 - Stage 1: Vision-language alignment
 - Stage 2: Large-scale multimodal pretraining

 MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
+Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation.
+Specifically, the pretraining pipeline is structured into the following four progressive stages:
 - Stage 1: Vision-language alignment
 - Stage 2: Large-scale multimodal pretraining