findcard12138 commited on
Commit
2d66ef2
·
verified ·
1 Parent(s): b46713f

Upload moss-video-preview-base

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -41,7 +41,7 @@ This repo contains the **pretrained weights** that are intended to serve as the
41
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
42
  </p>
43
 
44
- - **Native Isomorphic Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
45
  - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
46
  - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
47
 
 
41
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
42
  </p>
43
 
44
+ - **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
45
  - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
46
  - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
47