MOSS-Video-Preview-SFT

Introduction

We introduce MOSS-Video-Preview-SFT, the offline supervised fine-tuned checkpoint in the MOSS-Video-Preview series.

This is an offline SFT checkpoint (instruction-tuned). It is not the Real-Time SFT streaming checkpoint.

This checkpoint is intended for:

Offline video/image understanding with improved instruction following
Serving as a strong starting point for further Real-Time SFT or domain adaptation

Model Architecture

MOSS-Video-Preview is built on a Llama-3.2-Vision backbone, featuring a Pioneering Image-Video Unified Cross-Attention Architecture:

Native Unified Design: Unlike traditional projection methods, our architecture provides native, unified support for both image and video understanding, ensuring seamless temporal consistency.
Deep Multimodal Fusion: Leveraging specialized Cross-Attention mechanisms to achieve high-fidelity alignment between visual temporal features and linguistic context.
Unified Spatio-Temporal Encoding: Aligns video frame sequences and text tokens for robust, long-context multimodal reasoning.

Model Architecture

For architecture diagrams and full system details, see the top-level repository: fnlp-vision/MOSS-Video-Preview.

🌟 Key Highlights

🧩 Native Cross-Attention Base: A novel approach that decouples visual perception and linguistic generation for seamless real-time video understanding.
🔄 Dynamic Interaction Support: While this SFT version is for offline use, the underlying architecture is designed for "Silence-Speak" switching and real-time interruption.
⚡ High-Efficiency Inference: Optimized for Flash Attention 2 on both CUDA and NPU, ensuring low-latency processing even for long video streams.

🚀 Quickstart

Video inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Use Hugging Face model id (or load from a local folder with the same name).
checkpoint = "fnlp-vision/moss-video-preview-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    videos=[video_path],
    video_fps=1.0,
    video_minlen=8,
    video_maxlen=16,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

Image inference (Python)

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "fnlp-vision/moss-video-preview-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."

image = Image.open(image_path).convert("RGB")

processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    images=[image],
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

✅ Intended use

Offline instruction-following for video/image understanding (recommended default checkpoint for most users).
Finetuning starting point if you plan to train your own Real-Time SFT or domain-specific variant.

⚠️ Limitations & Future Outlook

Offline SFT Only: This specific checkpoint is optimized for offline instruction-following. For real-time streaming and dynamic interruption, please refer to our Real-Time SFT variant.
Performance Benchmarking: While leading in real-time architecture, a performance gap still exists compared to top-tier models like Qwen2.5-VL. Closing this gap is our primary focus for future iterations.
Distributed Training & Scaling: The current version is an architectural validation. Future releases will integrate the Megatron-LM framework for large-scale pre-training using 3D parallelism.
Data Diversity: Ongoing work is focused on expanding the scale and diversity of our training datasets to improve generalizability across more complex scenarios.

🧩 Requirements

Python: 3.10+
PyTorch: 1.13.1+ (GPU strongly recommended)
Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
CPU-only: PyTorch 2.4.0
Transformers: required with trust_remote_code=True for this model family (due to auto_map custom code)
Optional (recommended): FlashAttention 2 (attn_implementation="flash_attention_2")
Video decode: streaming demo imports OpenCV (cv2); offline demo relies on the processor's video loading backend

For full environment setup (including optional FlashAttention2 extras), see the top-level repository README.md.

🌟 Our Mission & Community Invitation

We have filled the gap in cross-attention-based foundation models for video understanding.

We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!

Citation

@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
  note          = {GitHub repository}
}

Downloads last month: 12

Safetensors

Model size

11B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fnlp-vision/moss-video-preview-sft

Base model

fnlp-vision/moss-video-preview-base

Finetuned

(1)

this model

Finetunes

1 model