MOSS-Video-Preview-Real-Time-SFT
Introduction
We introduce MOSS-Video-Preview-Real-Time-SFT, a specialized model derived from MOSS-Video-Preview-SFT through an additional Real-Time Supervised Fine-Tuning (Real-Time SFT).
This is a Real-Time SFT checkpoint. It is optimized for low-latency, high-frequency real-time video understanding.
This checkpoint is intended for:
- Real-time video understanding with true "see-and-say" capabilities.
- Low-latency interactive applications where Time to First Token (TTFT) is critical.
- Continuous video monitoring and instant action feedback.
Model Architecture
- Native Unified Design: Unlike traditional models, our architecture supports native frame-by-frame video injection, ensuring the visual context is always up-to-date with the generation process.
- Dual-Duplex Interaction: Specifically tuned for "Silence-Speak" switching. The model can be interrupted and self-correct its responses in real-time as the video scene evolves.
- Unified Spatio-Temporal Encoding: Features optimized gated positional embeddings and Cross-Attention KV Cache, allowing the model to maintain robust temporal context over extended streams.
For architecture diagrams and full system details, see the top-level repository: fnlp-vision/MOSS-Video-Preview.
🌊 Streaming Inference Mechanism
The core advantage of this model is its Asynchronous Streaming Capability, enabling true "See-and-Say" video intelligence.
- Asynchronous Single-Frame Streaming: Video frames are injected at a stable frequency. The input pipeline is non-blocking and decoupled from text generation, ensuring continuous perception.
- Persistent State Maintenance: Leveraging Cross-Attention KV Cache and temporal positional encoding, the model maintains long-range contextual dependencies across continuous frames.
- Instantaneous Streaming Response: Built on the optimized
MllamaVideoModel, it performs autoregressive generation alongside the visual stream, achieving ultra-low Time to First Token (TTFT).
🌟 Key Highlights
- 🧩 Decoupled Cross-Attention:A novel approach that decouples visual perception and linguistic generation for seamless real-time video understanding.
- 🔄 Millisecond-Level Interaction: Supports real-time interruption and dynamic response adjustment as the environment changes.
- ⚡ Hardware-Optimized Performance: Fully supports Flash Attention 2 and is compatible with CUDA/NPU platforms, optimized for long-context video stream processing.
🚀 Quickstart
Video streaming inference (Recommended for Real-Time SFT)
This mode uses the real_time_generate() API for low-latency streaming.
import os, queue, threading, time, cv2
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
def feed(video, q, fps=1.0):
cap=cv2.VideoCapture(video); step=max(1, round((cap.get(cv2.CAP_PROP_FPS) or 25)/fps)); i=0
while cap.isOpened():
ok, f = cap.read()
if not ok: break
if i % step == 0: q.put(Image.fromarray(cv2.cvtColor(f, cv2.COLOR_BGR2RGB))); time.sleep(1/fps)
i += 1
cap.release()
checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."
processor=AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model=AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, device_map="auto")
image_queue, prompt_queue, token_queue = queue.Queue(), queue.Queue(), queue.Queue()
threading.Thread(target=feed, args=(video_path, image_queue), daemon=True).start()
time.sleep(1)
prompt_queue.put(prompt)
threading.Thread(
target=lambda: model.real_time_generate(image_queue, prompt_queue, token_queue, processor),
daemon=True,
).start()
END={"[DONE]","[ERROR]","<|round_end|>"}; BANNER="\n"+"-"*30+" [Silence / Observing] "+"-"*30
pending=None; silent=False; last=time.time(); got=False
while True:
try: tok = token_queue.get(timeout=0.1)
except queue.Empty:
if pending: print(pending, end="", flush=True); pending=None
if got and time.time()-last>5: break
continue
got,last=True,time.time()
if tok=="<|round_start|>": pending=None; continue
if tok in END:
if pending: print(pending, end="", flush=True)
break
if tok=="<|silence|>":
if not silent:
if pending: print(pending, end="", flush=True); pending=None
print(BANNER, flush=True); silent=True
continue
silent=False
if pending: print(pending, end="", flush=True)
pending=tok
if hasattr(model,"stop_real_time_generate"): model.stop_real_time_generate()
Video offline inference
import os
import queue
import threading
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."
max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0
video_fps = 1.0
video_minlen = 8
video_maxlen = 256
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint, trust_remote_code=True, frame_extract_num_threads=1
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
if not checkpoint:
raise ValueError("Missing `checkpoint`.")
if not video_path:
raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
raise FileNotFoundError(f"Video not found: {video_path}")
model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()
new_queries.put(
{
"prompt": f"\n{prompt}",
"images": [],
"videos": [video_path],
"media_kwargs": {
"video_fps": video_fps,
"video_minlen": video_minlen,
"video_maxlen": video_maxlen,
},
"thinking_mode": "no_thinking",
"system_prompt_type": "video",
"generate_kwargs": {
"temperature": temperature,
"top_k": top_k,
"top_p": top_p,
"max_new_tokens": max_new_tokens,
"repetition_penalty": repetition_penalty,
},
"stop_offline_generate": False,
}
)
new_queries.put({"stop_offline_generate": True})
def drain_output():
while True:
tok = output_text_queue.get()
if tok == "<|round_end|>":
break
print(tok, end="", flush=True)
t = threading.Thread(target=drain_output, daemon=True)
t.start()
with torch.no_grad():
model.offline_generate(processor, new_queries, output_text_queue, vision_chunked_length=64)
t.join(timeout=5.0)
Image offline inference
import os, queue, threading, torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
if not os.path.isfile(image_path):
raise FileNotFoundError(image_path)
processor = AutoProcessor.from_pretrained(
checkpoint, trust_remote_code=True, frame_extract_num_threads=1
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
)
new_q, out_q = queue.Queue(), queue.Queue()
new_q.put(
{
"prompt": f"\n{prompt}",
"images": [Image.open(image_path).convert("RGB")],
"videos": [],
"system_prompt_type": "text_image",
"thinking_mode": "no_thinking",
"generate_kwargs": {"temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0},
"stop_offline_generate": False,
}
)
new_q.put({"stop_offline_generate": True})
threading.Thread(
target=lambda: (lambda: [print(t, end="", flush=True) for t in iter(out_q.get, "<|round_end|>")])(),
daemon=True,
).start()
with torch.no_grad():
model.offline_generate(processor, new_q, out_q, vision_chunked_length=64)
✅ Intended use
- Real-time "See-and-Say": Instant description and Q&A for live video streams.
- Low-latency Monitoring: Detecting events or actions in real-time with minimal delay.
- Interactive Multimodal Agents: Building responsive AI assistants that can see and interact.
⚠️ Limitations & Future Outlook
- High-End Hardware Recommended: For the best real-time experience (lowest latency), modern GPUs (e.g., A100/H100/H200) with FlashAttention 2 are strongly recommended.
- Performance Benchmarking: While leading in real-time interaction, a performance gap still exists in general benchmarks compared to models like Qwen2.5-VL. Continuous optimization is our primary focus.
- Scalable Distributed Training: We are migrating our training pipeline to the Megatron-LM framework, utilizing 3D parallelism to support even larger-scale pre-training and fine-tuning for future versions.
- Open-Source Commitment: The complete training codebase and experimental configurations will be released in the next major update.
🧩 Requirements
- Python: 3.10+
- PyTorch: 1.13.1+ (GPU strongly recommended)
- Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- CPU-only: PyTorch 2.4.0
- Transformers: required with
trust_remote_code=True - FlashAttention 2: Strongly recommended for low-latency inference.
- OpenCV: Required for video frame extraction in streaming demos.
🌟 Our Mission & Community Invitation
We have filled the gap in cross-attention-based foundation models for video understanding.
We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
Citation
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}
- Downloads last month
- 136
Model tree for OpenMOSS-Team/moss-video-preview-realtime-sft
Base model
OpenMOSS-Team/moss-video-preview-base