findcard12138 commited on
Commit
b46bdc1
·
verified ·
1 Parent(s): 71d6374

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -3
README.md CHANGED
@@ -1,3 +1,59 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MOSS-Video-Preview-Base 🤗
2
+
3
+ `moss-video-preview-base` is the **base checkpoint after pretraining** in the MOSS-Video-Preview series. It provides the core **video-capable MLLama-style** multimodal capabilities (video/image + text) and serves as the foundation for downstream SFT / real-time optimized variants.
4
+
5
+ ## 🚀 Training Stages
6
+
7
+ The training process for this model consists of three key stages:
8
+
9
+ ### 1. Stage 1: Vision-Language Alignment (PT1)
10
+ - **Objective**: Establish initial alignment between visual features and the language model, enabling basic visual understanding of video frames.
11
+ - **Configuration**:
12
+ - **Frozen Parameters**: Language Model (LLM) and Vision Tower.
13
+ - **Trainable Parameters**: Vision Projector.
14
+ - **Data**: Large-scale image-text pairs and short video clips.
15
+ - **Key Feature**: Introduces `mllama_add_video_position_encoding` to provide temporal position information for video frames.
16
+
17
+ ### 2. Stage 2: Full Spatio-Temporal Pretraining (PT2)
18
+ - **Objective**: Enhance the model's understanding of long videos and complex temporal relationships.
19
+ - **Configuration**:
20
+ - **Method**: Full Parameter Fine-tuning.
21
+ - **Trainable Parameters**: All modules (Vision Tower, Projector, and LLM) are unfrozen.
22
+ - **Data**: Video data with longer durations (supporting 256+ frames).
23
+ - **Key Feature**: Uses `mllama_use_full_attn` to enable full attention mechanisms, improving cross-frame modeling.
24
+
25
+ ### 3. Stage 3: Supervised Fine-Tuning (SFT)
26
+ - **Objective**: Enable the model to follow complex instructions for real-time streaming video dialogue and task processing.
27
+ - **Configuration**:
28
+ - **Template**: Uses the `mllama` instruction template.
29
+ - **Data**: High-quality video instruction-following datasets (e.g., real-time description, action recognition, video Q&A).
30
+ - **Optimization**: Optimized for streaming inference to produce coherent textual responses with low latency.
31
+
32
+ ## 🛠️ Key Technical Features
33
+
34
+ - **Native Streaming Architecture**: Supports continuous input and processing of video frames rather than discrete frame sampling.
35
+ - **Unified Position Encoding**: Shared synchronization mechanism for position encoding across both visual and textual modalities.
36
+ - **Efficient Pooling Strategy**: Employs `average` pooling with `stride=4` to balance computational efficiency and feature preservation.
37
+ - **Flash Attention 2**: Full support for FA2 acceleration to optimize memory usage during long-sequence training.
38
+
39
+ ## 🏗️ Model Architecture
40
+
41
+ The architecture of MOSS-Video-Preview is designed for maximum scalability and efficiency in processing multimodal temporal data. For more detailed information, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
42
+
43
+
44
+ ## 📥 Model Usage
45
+
46
+ For detailed usage instructions, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
47
+
48
+ ## 🛠️ Usage
49
+
50
+ This repository provides inference entry points under `inference/`. For end-to-end usage examples and detailed instructions, please refer to the official repository:
51
+
52
+ - [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
53
+
54
+
55
+ ## ⚠️ Notes
56
+
57
+ - This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
58
+ - The Python source files in this directory are referenced via `auto_map` in `config.json`, so `trust_remote_code=True` is typically required when loading from this local folder.
59
+