Upload moss-video-preview-base
Browse files
README.md
CHANGED
|
@@ -41,7 +41,7 @@ This repo contains the **pretrained weights** that are intended to serve as the
|
|
| 41 |
<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
|
| 42 |
</p>
|
| 43 |
|
| 44 |
-
- **Native
|
| 45 |
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
|
| 46 |
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
|
| 47 |
|
|
|
|
| 41 |
<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
|
| 42 |
</p>
|
| 43 |
|
| 44 |
+
- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
|
| 45 |
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
|
| 46 |
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
|
| 47 |
|