OpenMOSS-Team
/

moss-video-preview-base

@@ -27,18 +27,23 @@ This repo contains the **pretrained weights** that are intended to serve as the
 - **Offline SFT**: instruction-following and reasoning on full video segments
 - **Real-Time SFT**: low-latency streaming video understanding and response
 #### Model Architecture
-MOSS-Video-Preview is built on a **Llama-3.2-Vision** multimodal backbone with native support for **video / image + text**:
 <p align="center">
   <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
 </p>
-- **Multimodal projector + LLM**: maps visual features into the language model space for generation.
-- **Unified spatio-temporal position encoding**: aligns video frame order and text tokens for long-context multimodal reasoning.
 For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
@@ -150,17 +155,18 @@ print(processor.decode(output_ids[0], skip_special_tokens=True))
 </details>
-## ✅ Intended use
-- **Foundation checkpoint**: continue pretraining, run domain adaptation, or perform supervised fine-tuning (offline SFT / Real-Time SFT).
-- **System plumbing validation**: test multimodal IO, temporal position encoding, and long-context behavior.
-- **If you want instruction-following quality**: use `models/moss-video-sft` or `models/moss-video-realtime-sft` instead of this base checkpoint.
-## ⚠️ Limitations
-- **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
-- **Real-Time streaming not supported by default**: streaming generation APIs are typically provided by Real-Time SFT checkpoints.
-- **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
 ## 🧩 Requirements
@@ -202,4 +208,3 @@ For full environment setup (including optional FlashAttention2 extras), see the
   note          = {GitHub repository}
 }
 ```

 - **Offline SFT**: instruction-following and reasoning on full video segments
 - **Real-Time SFT**: low-latency streaming video understanding and response
+## 🌟 Key Highlights
+- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
+- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
+- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.
 #### Model Architecture
+**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Isomorphic Cross-Attention Architecture**:
 <p align="center">
   <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
 </p>
+- **Native Isomorphic Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
+- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
+- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
 For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
 </details>
+## ✅ Intended Use
+- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
+- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
+- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
+## ⚠️ Limitations & Future Outlook
+- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
+- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
+- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
+- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.
 ## 🧩 Requirements
   note          = {GitHub repository}
 }
 ```