findcard12138 commited on
Commit
b46713f
·
verified ·
1 Parent(s): a71b33d

Upload moss-video-preview-base

Browse files
Files changed (1) hide show
  1. README.md +17 -12
README.md CHANGED
@@ -27,18 +27,23 @@ This repo contains the **pretrained weights** that are intended to serve as the
27
  - **Offline SFT**: instruction-following and reasoning on full video segments
28
  - **Real-Time SFT**: low-latency streaming video understanding and response
29
 
 
30
 
 
 
 
31
 
32
  #### Model Architecture
33
 
34
- MOSS-Video-Preview is built on a **Llama-3.2-Vision** multimodal backbone with native support for **video / image + text**:
35
 
36
  <p align="center">
37
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
38
  </p>
39
 
40
- - **Multimodal projector + LLM**: maps visual features into the language model space for generation.
41
- - **Unified spatio-temporal position encoding**: aligns video frame order and text tokens for long-context multimodal reasoning.
 
42
 
43
  For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
44
 
@@ -150,17 +155,18 @@ print(processor.decode(output_ids[0], skip_special_tokens=True))
150
 
151
  </details>
152
 
153
- ## ✅ Intended use
154
 
155
- - **Foundation checkpoint**: continue pretraining, run domain adaptation, or perform supervised fine-tuning (offline SFT / Real-Time SFT).
156
- - **System plumbing validation**: test multimodal IO, temporal position encoding, and long-context behavior.
157
- - **If you want instruction-following quality**: use `models/moss-video-sft` or `models/moss-video-realtime-sft` instead of this base checkpoint.
158
 
159
- ## ⚠️ Limitations
160
 
161
- - **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
162
- - **Real-Time streaming not supported by default**: streaming generation APIs are typically provided by Real-Time SFT checkpoints.
163
- - **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
 
164
 
165
  ## 🧩 Requirements
166
 
@@ -202,4 +208,3 @@ For full environment setup (including optional FlashAttention2 extras), see the
202
  note = {GitHub repository}
203
  }
204
  ```
205
-
 
27
  - **Offline SFT**: instruction-following and reasoning on full video segments
28
  - **Real-Time SFT**: low-latency streaming video understanding and response
29
 
30
+ ## 🌟 Key Highlights
31
 
32
+ - **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
33
+ - **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
34
+ - **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.
35
 
36
  #### Model Architecture
37
 
38
+ **MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Isomorphic Cross-Attention Architecture**:
39
 
40
  <p align="center">
41
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
42
  </p>
43
 
44
+ - **Native Isomorphic Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
45
+ - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
46
+ - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
47
 
48
  For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
49
 
 
155
 
156
  </details>
157
 
158
+ ## ✅ Intended Use
159
 
160
+ - **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
161
+ - **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
162
+ - **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
163
 
164
+ ## ⚠️ Limitations & Future Outlook
165
 
166
+ - **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
167
+ - **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
168
+ - **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
169
+ - **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.
170
 
171
  ## 🧩 Requirements
172
 
 
208
  note = {GitHub repository}
209
  }
210
  ```