Upload moss-video-sft
Browse files
README.md
CHANGED
|
@@ -30,7 +30,7 @@ This checkpoint is intended for:
|
|
| 30 |
|
| 31 |
#### Model Architecture
|
| 32 |
|
| 33 |
-
MOSS-Video-Preview is built on a **Llama-3.2-Vision** backbone, featuring a **Pioneering Image-Video
|
| 34 |
|
| 35 |
- **Native Unified Design**: Unlike traditional projection methods, our architecture provides native, unified support for both image and video understanding, ensuring seamless temporal consistency.
|
| 36 |
- **Deep Multimodal Fusion**: Leveraging specialized Cross-Attention mechanisms to achieve high-fidelity alignment between visual temporal features and linguistic context.
|
|
|
|
| 30 |
|
| 31 |
#### Model Architecture
|
| 32 |
|
| 33 |
+
MOSS-Video-Preview is built on a **Llama-3.2-Vision** backbone, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:
|
| 34 |
|
| 35 |
- **Native Unified Design**: Unlike traditional projection methods, our architecture provides native, unified support for both image and video understanding, ensuring seamless temporal consistency.
|
| 36 |
- **Deep Multimodal Fusion**: Leveraging specialized Cross-Attention mechanisms to achieve high-fidelity alignment between visual temporal features and linguistic context.
|