findcard12138 commited on
Commit
de6119f
·
verified ·
1 Parent(s): b737ed8

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +161 -32
  3. assets/model_structure.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/model_structure.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,48 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # MOSS-Video-Preview-SFT 🤗
2
 
3
- MOSS-Video-Preview-SFT is a streaming video understanding model developed through two-stage pretraining and Supervised Fine-Tuning (SFT). Based on the Llama-3.2-Vision architecture, it achieves efficient understanding of streaming video by introducing native video processing capabilities and unified spatio-temporal position encoding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
- ## 🚀 Training Stages
 
6
 
7
- The training process for this model consists of three key stages:
 
8
 
9
- ### 1. Stage 1: Vision-Language Alignment (PT1)
10
- - **Objective**: Establish initial alignment between visual features and the language model, enabling basic visual understanding of video frames.
11
- - **Configuration**:
12
- - **Frozen Parameters**: Language Model (LLM) and Vision Tower.
13
- - **Trainable Parameters**: Vision Projector.
14
- - **Data**: Large-scale image-text pairs and short video clips.
15
- - **Key Feature**: Introduces `mllama_add_video_position_encoding` to provide temporal position information for video frames.
16
 
17
- ### 2. Stage 2: Full Spatio-Temporal Pretraining (PT2)
18
- - **Objective**: Enhance the model's understanding of long videos and complex temporal relationships.
19
- - **Configuration**:
20
- - **Method**: Full Parameter Fine-tuning.
21
- - **Trainable Parameters**: All modules (Vision Tower, Projector, and LLM) are unfrozen.
22
- - **Data**: Video data with longer durations (supporting 256+ frames).
23
- - **Key Feature**: Uses `mllama_use_full_attn` to enable full attention mechanisms, improving cross-frame modeling.
24
 
25
- ### 3. Stage 3: Supervised Fine-Tuning (SFT)
26
- - **Objective**: Enable the model to follow complex instructions for real-time streaming video dialogue and task processing.
27
- - **Configuration**:
28
- - **Template**: Uses the `mllama` instruction template.
29
- - **Data**: High-quality video instruction-following datasets (e.g., real-time description, action recognition, video Q&A).
30
- - **Optimization**: Optimized for streaming inference to produce coherent textual responses with low latency.
31
 
32
- ## 🛠️ Key Technical Features
 
33
 
34
- - **Native Streaming Architecture**: Supports continuous input and processing of video frames rather than discrete frame sampling.
35
- - **Unified Position Encoding**: Shared synchronization mechanism for position encoding across both visual and textual modalities.
36
- - **Efficient Pooling Strategy**: Employs `average` pooling with `stride=4` to balance computational efficiency and feature preservation.
37
- - **Flash Attention 2**: Full support for FA2 acceleration to optimize memory usage during long-sequence training.
38
 
39
- ## 🏗️ Model Architecture
 
 
 
 
40
 
41
- The architecture of MOSS-Video-Preview is designed for maximum scalability and efficiency in processing multimodal temporal data. For more detailed information, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
42
 
43
 
44
- ## 📥 Model Usage
45
 
46
- For detailed usage instructions, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
47
 
 
 
 
 
 
 
 
 
 
 
48
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ - video
9
+ - vision-language
10
+ - mllama
11
+ - streaming
12
+ - sft
13
+ ---
14
+
15
  # MOSS-Video-Preview-SFT 🤗
16
 
17
+ ## Introduction
18
+
19
+ We introduce **MOSS-Video-Preview-SFT**, the **offline supervised fine-tuned** checkpoint in the MOSS-Video-Preview series.
20
+
21
+ > [!Important]
22
+ > This is an **offline SFT** checkpoint (instruction-tuned). It is **not** the realtime-SFT streaming checkpoint.
23
+
24
+ This checkpoint is intended for:
25
+
26
+ - **Offline video/image understanding** with improved instruction following
27
+ - Serving as a strong starting point for further **realtime SFT** or domain adaptation
28
+
29
+ #### Model Architecture
30
+
31
+ MOSS-Video-Preview is built on a **Llama-3.2-Vision** multimodal backbone with native support for **video / image + text**:
32
+ <p align="center">
33
+ <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
34
+ </p>
35
+ - **Multimodal projector + LLM**: maps visual features into the language model space for generation.
36
+ - **Unified spatio-temporal position encoding**: aligns video frame order and text tokens for long-context multimodal reasoning.
37
+
38
+ For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
39
+
40
+ ## 🚀 Quickstart
41
+
42
+
43
+ ### Offline video inference (recommended)
44
+
45
+ #### Video inference (Python)
46
+
47
+ ```python
48
+ import torch
49
+ from transformers import AutoModelForCausalLM, AutoProcessor
50
+
51
+ # Use local path like: "models/moss-video-sft"
52
+ # Or use Hugging Face model id if published.
53
+ checkpoint = "models/moss-video-sft"
54
+ video_path = "data/example_video.mp4"
55
+ prompt = "Describe the video."
56
+
57
+ processor = AutoProcessor.from_pretrained(
58
+ checkpoint,
59
+ trust_remote_code=True,
60
+ frame_extract_num_threads=1,
61
+ )
62
+ model = AutoModelForCausalLM.from_pretrained(
63
+ checkpoint,
64
+ trust_remote_code=True,
65
+ device_map="auto",
66
+ torch_dtype=torch.bfloat16,
67
+ attn_implementation="flash_attention_2",
68
+ )
69
+
70
+ messages = [
71
+ {
72
+ "role": "user",
73
+ "content": [
74
+ {"type": "video"},
75
+ {"type": "text", "text": prompt},
76
+ ],
77
+ }
78
+ ]
79
+
80
+ input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
81
+ inputs = processor(
82
+ text=input_text,
83
+ videos=[video_path],
84
+ video_fps=1.0,
85
+ video_minlen=8,
86
+ video_maxlen=16,
87
+ add_special_tokens=False,
88
+ return_tensors="pt",
89
+ ).to(model.device)
90
+
91
+ with torch.no_grad():
92
+ output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
93
+
94
+ print(processor.decode(output_ids[0], skip_special_tokens=False))
95
+ ```
96
+
97
+ #### Image inference (Python)
98
+
99
+ ```python
100
+ import torch
101
+ from PIL import Image
102
+ from transformers import AutoModelForCausalLM, AutoProcessor
103
+
104
+ checkpoint = "models/moss-video-sft"
105
+ image_path = "data/example_image.jpg"
106
+ prompt = "Describe this image."
107
+
108
+ image = Image.open(image_path).convert("RGB")
109
+
110
+ processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
111
+ model = AutoModelForCausalLM.from_pretrained(
112
+ checkpoint,
113
+ trust_remote_code=True,
114
+ device_map="auto",
115
+ torch_dtype=torch.bfloat16,
116
+ attn_implementation="flash_attention_2",
117
+ )
118
+
119
+ messages = [
120
+ {
121
+ "role": "user",
122
+ "content": [
123
+ {"type": "image"},
124
+ {"type": "text", "text": prompt},
125
+ ],
126
+ }
127
+ ]
128
+
129
+ input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
130
+ inputs = processor(
131
+ text=input_text,
132
+ images=[image],
133
+ add_special_tokens=False,
134
+ return_tensors="pt",
135
+ ).to(model.device)
136
 
137
+ with torch.no_grad():
138
+ output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
139
 
140
+ print(processor.decode(output_ids[0], skip_special_tokens=False))
141
+ ```
142
 
143
+ ## Intended use
 
 
 
 
 
 
144
 
145
+ - **Offline instruction-following** for video/image understanding (recommended default checkpoint for most users).
146
+ - **Finetuning starting point** if you plan to train your own realtime-SFT or domain-specific variant.
 
 
 
 
 
147
 
148
+ ## ⚠️ Limitations
 
 
 
 
 
149
 
150
+ - **Not realtime-SFT**: this checkpoint may not expose streaming generation APIs such as `real_time_generate()`.
151
+ - **Latency/throughput depend on decoding & hardware**: FlashAttention 2 + `bfloat16` on modern GPUs is recommended.
152
 
153
+ ## 🧩 Requirements
 
 
 
154
 
155
+ - **Python**: 3.10+
156
+ - **PyTorch**: 1.13.1+ (GPU strongly recommended)
157
+ - **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
158
+ - **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
159
+ - **Video decode**: streaming demo imports OpenCV (`cv2`); offline demo relies on the processor's video loading backend
160
 
161
+ For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.
162
 
163
 
 
164
 
165
+ ## Citation
166
 
167
+ ```bibtex
168
+ @misc{moss_video_2026,
169
+ title = {MOSS-Video-Preview: Towards Synchronized Streaming Video Understanding},
170
+ author = {OpenMOSS Team},
171
+ year = {2026},
172
+ publisher = {GitHub},
173
+ journal = {GitHub repository},
174
+ howpublished = {\url{https://github.com/OpenMOSS/MOSS-Video-Preview}}
175
+ }
176
+ ```
177
 
assets/model_structure.png ADDED

Git LFS Details

  • SHA256: 51d04cc34abd90cdc24e3198329a19efaba449e5a857e8f4d7a4544087be59dc
  • Pointer size: 131 Bytes
  • Size of remote file: 217 kB