Upload moss-video-preview-base
Browse files
README.md
CHANGED
|
@@ -27,18 +27,23 @@ This repo contains the **pretrained weights** that are intended to serve as the
|
|
| 27 |
- **Offline SFT**: instruction-following and reasoning on full video segments
|
| 28 |
- **Real-Time SFT**: low-latency streaming video understanding and response
|
| 29 |
|
|
|
|
| 30 |
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
#### Model Architecture
|
| 33 |
|
| 34 |
-
MOSS-Video-Preview is
|
| 35 |
|
| 36 |
<p align="center">
|
| 37 |
<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
|
| 38 |
</p>
|
| 39 |
|
| 40 |
-
- **
|
| 41 |
-
- **
|
|
|
|
| 42 |
|
| 43 |
For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
|
| 44 |
|
|
@@ -150,17 +155,18 @@ print(processor.decode(output_ids[0], skip_special_tokens=True))
|
|
| 150 |
|
| 151 |
</details>
|
| 152 |
|
| 153 |
-
## ✅ Intended
|
| 154 |
|
| 155 |
-
- **
|
| 156 |
-
- **
|
| 157 |
-
- **
|
| 158 |
|
| 159 |
-
## ⚠️ Limitations
|
| 160 |
|
| 161 |
-
- **
|
| 162 |
-
- **
|
| 163 |
-
- **
|
|
|
|
| 164 |
|
| 165 |
## 🧩 Requirements
|
| 166 |
|
|
@@ -202,4 +208,3 @@ For full environment setup (including optional FlashAttention2 extras), see the
|
|
| 202 |
note = {GitHub repository}
|
| 203 |
}
|
| 204 |
```
|
| 205 |
-
|
|
|
|
| 27 |
- **Offline SFT**: instruction-following and reasoning on full video segments
|
| 28 |
- **Real-Time SFT**: low-latency streaming video understanding and response
|
| 29 |
|
| 30 |
+
## 🌟 Key Highlights
|
| 31 |
|
| 32 |
+
- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
|
| 33 |
+
- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
|
| 34 |
+
- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.
|
| 35 |
|
| 36 |
#### Model Architecture
|
| 37 |
|
| 38 |
+
**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Isomorphic Cross-Attention Architecture**:
|
| 39 |
|
| 40 |
<p align="center">
|
| 41 |
<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
|
| 42 |
</p>
|
| 43 |
|
| 44 |
+
- **Native Isomorphic Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
|
| 45 |
+
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
|
| 46 |
+
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
|
| 47 |
|
| 48 |
For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
|
| 49 |
|
|
|
|
| 155 |
|
| 156 |
</details>
|
| 157 |
|
| 158 |
+
## ✅ Intended Use
|
| 159 |
|
| 160 |
+
- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
|
| 161 |
+
- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
|
| 162 |
+
- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
|
| 163 |
|
| 164 |
+
## ⚠️ Limitations & Future Outlook
|
| 165 |
|
| 166 |
+
- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
|
| 167 |
+
- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
|
| 168 |
+
- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
|
| 169 |
+
- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.
|
| 170 |
|
| 171 |
## 🧩 Requirements
|
| 172 |
|
|
|
|
| 208 |
note = {GitHub repository}
|
| 209 |
}
|
| 210 |
```
|
|
|