MOSS-Video-Preview-SFT
Introduction
We introduce MOSS-Video-Preview-SFT, the offline supervised fine-tuned checkpoint in the MOSS-Video-Preview series.
This is an offline SFT checkpoint (instruction-tuned). It is not the Real-Time SFT streaming checkpoint.
This checkpoint is intended for:
- Offline video/image understanding with improved instruction following
- Serving as a strong starting point for further Real-Time SFT or domain adaptation
Model Architecture
MOSS-Video-Preview is built on a Llama-3.2-Vision backbone, featuring a Pioneering Image-Video Unified Cross-Attention Architecture:
- Native Unified Design: Unlike traditional projection methods, our architecture provides native, unified support for both image and video understanding, ensuring seamless temporal consistency.
- Deep Multimodal Fusion: Leveraging specialized Cross-Attention mechanisms to achieve high-fidelity alignment between visual temporal features and linguistic context.
- Unified Spatio-Temporal Encoding: Aligns video frame sequences and text tokens for robust, long-context multimodal reasoning.
For architecture diagrams and full system details, see the top-level repository: fnlp-vision/MOSS-Video-Preview.
🌟 Key Highlights
- 🧩 Native Cross-Attention Base: A novel approach that decouples visual perception and linguistic generation for seamless real-time video understanding.
- 🔄 Dynamic Interaction Support: While this SFT version is for offline use, the underlying architecture is designed for "Silence-Speak" switching and real-time interruption.
- ⚡ High-Efficiency Inference: Optimized for Flash Attention 2 on both CUDA and NPU, ensuring low-latency processing even for long video streams.
🚀 Quickstart
Video inference (Python)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
# Use Hugging Face model id (or load from a local folder with the same name).
checkpoint = "fnlp-vision/moss-video-preview-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
Image inference (Python)
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
✅ Intended use
- Offline instruction-following for video/image understanding (recommended default checkpoint for most users).
- Finetuning starting point if you plan to train your own Real-Time SFT or domain-specific variant.
⚠️ Limitations & Future Outlook
- Offline SFT Only: This specific checkpoint is optimized for offline instruction-following. For real-time streaming and dynamic interruption, please refer to our Real-Time SFT variant.
- Performance Benchmarking: While leading in real-time architecture, a performance gap still exists compared to top-tier models like Qwen2.5-VL. Closing this gap is our primary focus for future iterations.
- Distributed Training & Scaling: The current version is an architectural validation. Future releases will integrate the Megatron-LM framework for large-scale pre-training using 3D parallelism.
- Data Diversity: Ongoing work is focused on expanding the scale and diversity of our training datasets to improve generalizability across more complex scenarios.
🧩 Requirements
- Python: 3.10+
- PyTorch: 1.13.1+ (GPU strongly recommended)
- Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- CPU-only: PyTorch 2.4.0
- Transformers: required with
trust_remote_code=Truefor this model family (due toauto_mapcustom code) - Optional (recommended): FlashAttention 2 (
attn_implementation="flash_attention_2") - Video decode: streaming demo imports OpenCV (
cv2); offline demo relies on the processor's video loading backend
For full environment setup (including optional FlashAttention2 extras), see the top-level repository README.md.
🌟 Our Mission & Community Invitation
We have filled the gap in cross-attention-based foundation models for video understanding.
We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
Citation
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}
- Downloads last month
- 12