OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

df5bb20

·

verified ·

1 Parent(s): 3411f0a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -41,7 +41,7 @@ Specifically, the pretraining pipeline is structured into the following four pro
 ### ✨ Highlights
 - 📐 **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
-- 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing or separate routing logic.
 ## 🏗 Model Architecture

 ### ✨ Highlights
 - 📐 **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
+- 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.
 ## 🏗 Model Architecture