OpenMOSS-Team
/

moss-video-preview-base

+# MOSS-Video-Preview-Base 🤗
+`moss-video-preview-base` is the **base checkpoint after pretraining** in the MOSS-Video-Preview series. It provides the core **video-capable MLLama-style** multimodal capabilities (video/image + text) and serves as the foundation for downstream SFT / real-time optimized variants.
+## 🚀 Training Stages
+The training process for this model consists of three key stages:
+### 1. Stage 1: Vision-Language Alignment (PT1)
+- **Objective**: Establish initial alignment between visual features and the language model, enabling basic visual understanding of video frames.
+- **Configuration**:
+  - **Frozen Parameters**: Language Model (LLM) and Vision Tower.
+  - **Trainable Parameters**: Vision Projector.
+  - **Data**: Large-scale image-text pairs and short video clips.
+  - **Key Feature**: Introduces `mllama_add_video_position_encoding` to provide temporal position information for video frames.
+### 2. Stage 2: Full Spatio-Temporal Pretraining (PT2)
+- **Objective**: Enhance the model's understanding of long videos and complex temporal relationships.
+- **Configuration**:
+  - **Method**: Full Parameter Fine-tuning.
+  - **Trainable Parameters**: All modules (Vision Tower, Projector, and LLM) are unfrozen.
+  - **Data**: Video data with longer durations (supporting 256+ frames).
+  - **Key Feature**: Uses `mllama_use_full_attn` to enable full attention mechanisms, improving cross-frame modeling.
+### 3. Stage 3: Supervised Fine-Tuning (SFT)
+- **Objective**: Enable the model to follow complex instructions for real-time streaming video dialogue and task processing.
+- **Configuration**:
+  - **Template**: Uses the `mllama` instruction template.
+  - **Data**: High-quality video instruction-following datasets (e.g., real-time description, action recognition, video Q&A).
+  - **Optimization**: Optimized for streaming inference to produce coherent textual responses with low latency.
+## 🛠️ Key Technical Features
+- **Native Streaming Architecture**: Supports continuous input and processing of video frames rather than discrete frame sampling.
+- **Unified Position Encoding**: Shared synchronization mechanism for position encoding across both visual and textual modalities.
+- **Efficient Pooling Strategy**: Employs `average` pooling with `stride=4` to balance computational efficiency and feature preservation.
+- **Flash Attention 2**: Full support for FA2 acceleration to optimize memory usage during long-sequence training.
+## 🏗️ Model Architecture
+The architecture of MOSS-Video-Preview is designed for maximum scalability and efficiency in processing multimodal temporal data. For more detailed information, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
+## 📥 Model Usage
+For detailed usage instructions, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
+## 🛠️ Usage
+This repository provides inference entry points under `inference/`. For end-to-end usage examples and detailed instructions, please refer to the official repository:
+- [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
+## ⚠️ Notes
+- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
+- The Python source files in this directory are referenced via `auto_map` in `config.json`, so `trust_remote_code=True` is typically required when loading from this local folder.