Video-Text-to-Text
Transformers
Safetensors
English
moss_vl
feature-extraction
Base
Video-Understanding
Image-Understanding
MOSS-VL
OpenMOSS
multimodal
video
vision-language
custom_code
Instructions to use OpenMOSS-Team/MOSS-VL-Base-0408 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-VL-Base-0408 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-VL-Base-0408", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -45,7 +45,7 @@ Built through four stages of multimodal pretraining only, this checkpoint serves
|
|
| 45 |
|
| 46 |
## 🏗 Model Architecture
|
| 47 |
|
| 48 |
-
**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a
|
| 49 |
|
| 50 |
<p align="center">
|
| 51 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
|
|
|
| 45 |
|
| 46 |
## 🏗 Model Architecture
|
| 47 |
|
| 48 |
+
**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.
|
| 49 |
|
| 50 |
<p align="center">
|
| 51 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|