Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +161 -32
assets/model_structure.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/model_structure.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,48 +1,177 @@
 # MOSS-Video-Preview-SFT 🤗
-MOSS-Video-Preview-SFT is a streaming video understanding model developed through two-stage pretraining and Supervised Fine-Tuning (SFT). Based on the Llama-3.2-Vision architecture, it achieves efficient understanding of streaming video by introducing native video processing capabilities and unified spatio-temporal position encoding.
-## 🚀 Training Stages
-The training process for this model consists of three key stages:
-### 1. Stage 1: Vision-Language Alignment (PT1)
-- **Objective**: Establish initial alignment between visual features and the language model, enabling basic visual understanding of video frames.
-- **Configuration**:
-  - **Frozen Parameters**: Language Model (LLM) and Vision Tower.
-  - **Trainable Parameters**: Vision Projector.
-  - **Data**: Large-scale image-text pairs and short video clips.
-  - **Key Feature**: Introduces `mllama_add_video_position_encoding` to provide temporal position information for video frames.
-### 2. Stage 2: Full Spatio-Temporal Pretraining (PT2)
-- **Objective**: Enhance the model's understanding of long videos and complex temporal relationships.
-- **Configuration**:
-  - **Method**: Full Parameter Fine-tuning.
-  - **Trainable Parameters**: All modules (Vision Tower, Projector, and LLM) are unfrozen.
-  - **Data**: Video data with longer durations (supporting 256+ frames).
-  - **Key Feature**: Uses `mllama_use_full_attn` to enable full attention mechanisms, improving cross-frame modeling.
-### 3. Stage 3: Supervised Fine-Tuning (SFT)
-- **Objective**: Enable the model to follow complex instructions for real-time streaming video dialogue and task processing.
-- **Configuration**:
-  - **Template**: Uses the `mllama` instruction template.
-  - **Data**: High-quality video instruction-following datasets (e.g., real-time description, action recognition, video Q&A).
-  - **Optimization**: Optimized for streaming inference to produce coherent textual responses with low latency.
-## 🛠️ Key Technical Features
-- **Native Streaming Architecture**: Supports continuous input and processing of video frames rather than discrete frame sampling.
-- **Unified Position Encoding**: Shared synchronization mechanism for position encoding across both visual and textual modalities.
-- **Efficient Pooling Strategy**: Employs `average` pooling with `stride=4` to balance computational efficiency and feature preservation.
-- **Flash Attention 2**: Full support for FA2 acceleration to optimize memory usage during long-sequence training.
-## 🏗️ Model Architecture
-The architecture of MOSS-Video-Preview is designed for maximum scalability and efficiency in processing multimodal temporal data. For more detailed information, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)
-## 📥 Model Usage
-For detailed usage instructions, please refer to the official repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview)

+---
+language:
+- en
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- video
+- vision-language
+- mllama
+- streaming
+- sft
+---
 # MOSS-Video-Preview-SFT 🤗
+## Introduction
+We introduce **MOSS-Video-Preview-SFT**, the **offline supervised fine-tuned** checkpoint in the MOSS-Video-Preview series.
+> [!Important]
+> This is an **offline SFT** checkpoint (instruction-tuned). It is **not** the realtime-SFT streaming checkpoint.
+This checkpoint is intended for:
+- **Offline video/image understanding** with improved instruction following
+- Serving as a strong starting point for further **realtime SFT** or domain adaptation
+#### Model Architecture
+MOSS-Video-Preview is built on a **Llama-3.2-Vision** multimodal backbone with native support for **video / image + text**:
+<p align="center">
+  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
+</p>
+- **Multimodal projector + LLM**: maps visual features into the language model space for generation.
+- **Unified spatio-temporal position encoding**: aligns video frame order and text tokens for long-context multimodal reasoning.
+For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
+## 🚀 Quickstart
+### Offline video inference (recommended)
+#### Video inference (Python)
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor
+# Use local path like: "models/moss-video-sft"
+# Or use Hugging Face model id if published.
+checkpoint = "models/moss-video-sft"
+video_path = "data/example_video.mp4"
+prompt = "Describe the video."
+processor = AutoProcessor.from_pretrained(
+    checkpoint,
+    trust_remote_code=True,
+    frame_extract_num_threads=1,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    checkpoint,
+    trust_remote_code=True,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video"},
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(
+    text=input_text,
+    videos=[video_path],
+    video_fps=1.0,
+    video_minlen=8,
+    video_maxlen=16,
+    add_special_tokens=False,
+    return_tensors="pt",
+).to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
+print(processor.decode(output_ids[0], skip_special_tokens=False))
+```
+#### Image inference (Python)
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoProcessor
+checkpoint = "models/moss-video-sft"
+image_path = "data/example_image.jpg"
+prompt = "Describe this image."
+image = Image.open(image_path).convert("RGB")
+processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    checkpoint,
+    trust_remote_code=True,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(
+    text=input_text,
+    images=[image],
+    add_special_tokens=False,
+    return_tensors="pt",
+).to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
+print(processor.decode(output_ids[0], skip_special_tokens=False))
+```
+## ✅ Intended use
+- **Offline instruction-following** for video/image understanding (recommended default checkpoint for most users).
+- **Finetuning starting point** if you plan to train your own realtime-SFT or domain-specific variant.
+## ⚠️ Limitations
+- **Not realtime-SFT**: this checkpoint may not expose streaming generation APIs such as `real_time_generate()`.
+- **Latency/throughput depend on decoding & hardware**: FlashAttention 2 + `bfloat16` on modern GPUs is recommended.
+## 🧩 Requirements
+- **Python**: 3.10+
+- **PyTorch**: 1.13.1+ (GPU strongly recommended)
+- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
+- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
+- **Video decode**: streaming demo imports OpenCV (`cv2`); offline demo relies on the processor's video loading backend
+For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.
+## Citation
+```bibtex
+@misc{moss_video_2026,
+  title         = {MOSS-Video-Preview: Towards Synchronized Streaming Video Understanding},
+  author        = {OpenMOSS Team},
+  year          = {2026},
+  publisher     = {GitHub},
+  journal       = {GitHub repository},
+  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-Video-Preview}}
+}
+```

assets/model_structure.png ADDED Viewed

Git LFS Details

SHA256: 51d04cc34abd90cdc24e3198329a19efaba449e5a857e8f4d7a4544087be59dc
Pointer size: 131 Bytes
Size of remote file: 217 kB