Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,59 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MOSS-Video-Preview-Base 🤗
|
| 2 |
+
|
| 3 |
+
`moss-video-preview-base` is the **base checkpoint after pretraining** in the MOSS-Video-Preview series. It provides the core **video-capable MLLama-style** multimodal capabilities (video/image + text) and serves as the foundation for downstream SFT / real-time optimized variants.
|
| 4 |
+
|
| 5 |
+
## 🚀 Training Stages
|
| 6 |
+
|
| 7 |
+
The training process for this model consists of three key stages:
|
| 8 |
+
|
| 9 |
+
### 1. Stage 1: Vision-Language Alignment (PT1)
|
| 10 |
+
- **Objective**: Establish initial alignment between visual features and the language model, enabling basic visual understanding of video frames.
|
| 11 |
+
- **Configuration**:
|
| 12 |
+
- **Frozen Parameters**: Language Model (LLM) and Vision Tower.
|
| 13 |
+
- **Trainable Parameters**: Vision Projector.
|
| 14 |
+
- **Data**: Large-scale image-text pairs and short video clips.
|
| 15 |
+
- **Key Feature**: Introduces `mllama_add_video_position_encoding` to provide temporal position information for video frames.
|
| 16 |
+
|
| 17 |
+
### 2. Stage 2: Full Spatio-Temporal Pretraining (PT2)
|
| 18 |
+
- **Objective**: Enhance the model's understanding of long videos and complex temporal relationships.
|
| 19 |
+
- **Configuration**:
|
| 20 |
+
- **Method**: Full Parameter Fine-tuning.
|
| 21 |
+
- **Trainable Parameters**: All modules (Vision Tower, Projector, and LLM) are unfrozen.
|
| 22 |
+
- **Data**: Video data with longer durations (supporting 256+ frames).
|
| 23 |
+
- **Key Feature**: Uses `mllama_use_full_attn` to enable full attention mechanisms, improving cross-frame modeling.
|
| 24 |
+
|
| 25 |
+
### 3. Stage 3: Supervised Fine-Tuning (SFT)
|
| 26 |
+
- **Objective**: Enable the model to follow complex instructions for real-time streaming video dialogue and task processing.
|
| 27 |
+
- **Configuration**:
|
| 28 |
+
- **Template**: Uses the `mllama` instruction template.
|
| 29 |
+
- **Data**: High-quality video instruction-following datasets (e.g., real-time description, action recognition, video Q&A).
|
| 30 |
+
- **Optimization**: Optimized for streaming inference to produce coherent textual responses with low latency.
|
| 31 |
+
|
| 32 |
+
## 🛠️ Key Technical Features
|
| 33 |
+
|
| 34 |
+
- **Native Streaming Architecture**: Supports continuous input and processing of video frames rather than discrete frame sampling.
|
| 35 |
+
- **Unified Position Encoding**: Shared synchronization mechanism for position encoding across both visual and textual modalities.
|
| 36 |
+
- **Efficient Pooling Strategy**: Employs `average` pooling with `stride=4` to balance computational efficiency and feature preservation.
|
| 37 |
+
- **Flash Attention 2**: Full support for FA2 acceleration to optimize memory usage during long-sequence training.
|
| 38 |
+
|
| 39 |
+
## 🏗️ Model Architecture
|
| 40 |
+
|
| 41 |
+
The architecture of MOSS-Video-Preview is designed for maximum scalability and efficiency in processing multimodal temporal data. For more detailed information, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
## 📥 Model Usage
|
| 45 |
+
|
| 46 |
+
For detailed usage instructions, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
|
| 47 |
+
|
| 48 |
+
## 🛠️ Usage
|
| 49 |
+
|
| 50 |
+
This repository provides inference entry points under `inference/`. For end-to-end usage examples and detailed instructions, please refer to the official repository:
|
| 51 |
+
|
| 52 |
+
- [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
## ⚠️ Notes
|
| 56 |
+
|
| 57 |
+
- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
|
| 58 |
+
- The Python source files in this directory are referenced via `auto_map` in `config.json`, so `trust_remote_code=True` is typically required when loading from this local folder.
|
| 59 |
+
|