MOSS-Video-Preview-Real-Time-SFT

Introduction

We introduce MOSS-Video-Preview-Real-Time-SFT, a specialized model derived from MOSS-Video-Preview-SFT through an additional Real-Time Supervised Fine-Tuning (Real-Time SFT).

This is a Real-Time SFT checkpoint. It is optimized for low-latency, high-frequency real-time video understanding.

This checkpoint is intended for:

Real-time video understanding with true "see-and-say" capabilities.
Low-latency interactive applications where Time to First Token (TTFT) is critical.
Continuous video monitoring and instant action feedback.

Model Architecture

**MOSS-Video-Preview-Real-Time-SFT** is the flagship model of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture** optimized for streaming:

Native Unified Design: Unlike traditional models, our architecture supports native frame-by-frame video injection, ensuring the visual context is always up-to-date with the generation process.
Dual-Duplex Interaction: Specifically tuned for "Silence-Speak" switching. The model can be interrupted and self-correct its responses in real-time as the video scene evolves.
Unified Spatio-Temporal Encoding: Features optimized gated positional embeddings and Cross-Attention KV Cache, allowing the model to maintain robust temporal context over extended streams.

For architecture diagrams and full system details, see the top-level repository: OpenMOSS/MOSS-Video-Preview.

🌊 Streaming Inference Mechanism

The core advantage of this model is its Asynchronous Streaming Capability, enabling true "See-and-Say" video intelligence.

Streaming Inference Mechanism

Asynchronous Single-Frame Streaming: Video frames are injected at a stable frequency. The input pipeline is non-blocking and decoupled from text generation, ensuring continuous perception.
Persistent State Maintenance: Leveraging Cross-Attention KV Cache and temporal positional encoding, the model maintains long-range contextual dependencies across continuous frames.
Instantaneous Streaming Response: Built on the optimized MllamaVideoModel, it performs autoregressive generation alongside the visual stream, achieving ultra-low Time to First Token (TTFT).

🌟 Key Highlights

🧩 Decoupled Cross-Attention:A novel approach that decouples visual perception and linguistic generation for seamless real-time video understanding.
🔄 Millisecond-Level Interaction: Supports real-time interruption and dynamic response adjustment as the environment changes.
⚡ Hardware-Optimized Performance: Fully supports Flash Attention 2 and is compatible with CUDA/NPU platforms, optimized for long-context video stream processing.

🚀 Quickstart

Video streaming inference (Recommended for Real-Time SFT)

This mode uses the real_time_generate() API for low-latency streaming.

import os, queue, threading, time, cv2
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

def feed(video, q, fps=1.0):
    cap=cv2.VideoCapture(video); step=max(1, round((cap.get(cv2.CAP_PROP_FPS) or 25)/fps)); i=0
    while cap.isOpened():
        ok, f = cap.read()
        if not ok: break
        if i % step == 0: q.put(Image.fromarray(cv2.cvtColor(f, cv2.COLOR_BGR2RGB))); time.sleep(1/fps)
        i += 1
    cap.release()

checkpoint = "OpenMOSS-Team/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

processor=AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model=AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, device_map="auto")

image_queue, prompt_queue, token_queue = queue.Queue(), queue.Queue(), queue.Queue()
threading.Thread(target=feed, args=(video_path, image_queue), daemon=True).start()
time.sleep(1)
prompt_queue.put(prompt)
threading.Thread(
    target=lambda: model.real_time_generate(image_queue, prompt_queue, token_queue, processor),
    daemon=True,
).start()

END={"[DONE]","[ERROR]","<|round_end|>"}; BANNER="\n"+"-"*30+" [Silence / Observing] "+"-"*30
pending=None; silent=False; last=time.time(); got=False
while True:
    try: tok = token_queue.get(timeout=0.1)
    except queue.Empty:
        if pending: print(pending, end="", flush=True); pending=None
        if got and time.time()-last>5: break
        continue
    got,last=True,time.time()
    if tok=="<|round_start|>": pending=None; continue
    if tok in END:
        if pending: print(pending, end="", flush=True)
        break
    if tok=="<|silence|>":
        if not silent:
            if pending: print(pending, end="", flush=True); pending=None
            print(BANNER, flush=True); silent=True
        continue
    silent=False
    if pending: print(pending, end="", flush=True)
    pending=tok

if hasattr(model,"stop_real_time_generate"): model.stop_real_time_generate()

Video offline inference

import os
import queue
import threading

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "OpenMOSS-Team/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0

video_fps = 1.0
video_minlen = 8
video_maxlen = 256


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint, trust_remote_code=True, frame_extract_num_threads=1
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


if not checkpoint:
    raise ValueError("Missing `checkpoint`.")
if not video_path:
    raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
    raise FileNotFoundError(f"Video not found: {video_path}")

model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()

new_queries.put(
    {
        "prompt": f"\n{prompt}",
        "images": [],
        "videos": [video_path],
        "media_kwargs": {
            "video_fps": video_fps,
            "video_minlen": video_minlen,
            "video_maxlen": video_maxlen,
        },
        "thinking_mode": "no_thinking",
        "system_prompt_type": "video",
        "generate_kwargs": {
            "temperature": temperature,
            "top_k": top_k,
            "top_p": top_p,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
        },
        "stop_offline_generate": False,
    }
)
new_queries.put({"stop_offline_generate": True})


def drain_output():
    while True:
        tok = output_text_queue.get()
        if tok == "<|round_end|>":
            break
        print(tok, end="", flush=True)


t = threading.Thread(target=drain_output, daemon=True)
t.start()
with torch.no_grad():
    model.offline_generate(processor, new_queries, output_text_queue, vision_chunked_length=64)
t.join(timeout=5.0)

Image offline inference



import os, queue, threading, torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "OpenMOSS-Team/moss-video-preview-realtime-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
if not os.path.isfile(image_path):
    raise FileNotFoundError(image_path)

processor = AutoProcessor.from_pretrained(
    checkpoint, trust_remote_code=True, frame_extract_num_threads=1
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
)

new_q, out_q = queue.Queue(), queue.Queue()
new_q.put(
    {
        "prompt": f"\n{prompt}",
        "images": [Image.open(image_path).convert("RGB")],
        "videos": [],
        "system_prompt_type": "text_image",
        "thinking_mode": "no_thinking",
        "generate_kwargs": {"temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0},
        "stop_offline_generate": False,
    }
)
new_q.put({"stop_offline_generate": True})

threading.Thread(
    target=lambda: (lambda: [print(t, end="", flush=True) for t in iter(out_q.get, "<|round_end|>")])(),
    daemon=True,
).start()

with torch.no_grad():
    model.offline_generate(processor, new_q, out_q, vision_chunked_length=64)

✅ Intended use

Real-time "See-and-Say": Instant description and Q&A for live video streams.
Low-latency Monitoring: Detecting events or actions in real-time with minimal delay.
Interactive Multimodal Agents: Building responsive AI assistants that can see and interact.

⚠️ Limitations & Future Outlook

High-End Hardware Recommended: For the best real-time experience (lowest latency), modern GPUs (e.g., A100/H100/H200) with FlashAttention 2 are strongly recommended.
Performance Benchmarking: While leading in real-time interaction, a performance gap still exists in general benchmarks compared to models like Qwen2.5-VL. Continuous optimization is our primary focus.
Scalable Distributed Training: We are migrating our training pipeline to the Megatron-LM framework, utilizing 3D parallelism to support even larger-scale pre-training and fine-tuning for future versions.
Open-Source Commitment: The complete training codebase and experimental configurations will be released in the next major update.

🧩 Requirements

Python: 3.10+
PyTorch: 1.13.1+ (GPU strongly recommended)
Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
CPU-only: PyTorch 2.4.0
Transformers: required with trust_remote_code=True
FlashAttention 2: Strongly recommended for low-latency inference.
OpenCV: Required for video frame extraction in streaming demos.

🌟 Our Mission & Community Invitation

We have filled the gap in cross-attention-based foundation models for video understanding.

We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!

Citation

@article{wang2026mossvideo,
  title         = {{MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention}},
  author        = {Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu},
  year          = {2026},
  journal       = {arXiv preprint arXiv:2606.07639},
  eprint        = {2606.07639},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.07639}
}