OpenMOSS-Team
/

moss-video-preview-base

Video-Text-to-Text

text-generation

vision-language

text-generation-inference

Model card Files Files and versions

findcard12138 commited on 21 days ago

Commit

2d66ef2

·

verified ·

1 Parent(s): b46713f

Upload moss-video-preview-base

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -41,7 +41,7 @@ This repo contains the **pretrained weights** that are intended to serve as the
   <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
 </p>
-- **Native Isomorphic Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
 - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
 - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.

   <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
 </p>
+- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
 - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
 - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.