OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

db9bcfa

·

verified ·

1 Parent(s): 961831a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ Built through four stages of multimodal pretraining only, this checkpoint serves
 ## 🏗 Model Architecture
-**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a flexible multimodal backbone for image and video understanding while preserving a clean foundation for downstream alignment and adaptation.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>

 ## 🏗 Model Architecture
+**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>