MOSS-Video-Preview-Base
Introduction
We introduce MOSS-Video-Preview-Base, the pretrained foundation checkpoint in the MOSS-Video-Preview series.
This is a pretrained model checkpoint without supervised instruction tuning (no offline SFT / no Real-Time SFT).
This repo contains the pretrained weights that are intended to serve as the starting point for downstream:
- Offline SFT: instruction-following and reasoning on full video segments
- Real-Time SFT: low-latency streaming video understanding and response
🌟 Key Highlights
- 🧩 First Cross-Attention Base: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
- 🔄 Streaming-Ready Backbone: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
- ⚡ Extreme Efficiency: Optimized for Flash Attention 2 and compatible with NPU/CUDA platforms, providing a high-throughput starting point for long-video research.
Model Architecture
MOSS-Video-Preview-Base is the foundational checkpoint of the series, featuring a Pioneering Image-Video Unified Cross-Attention Architecture:
- Native Unified Design: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
- Cross-Modal Projector: Powered by the proprietary
VideoMllamaTextCrossAttentionmechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context. - Unified Spatio-Temporal Encoding: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
For architecture diagrams and full system details, see the top-level repository: fnlp-vision/MOSS-Video-Preview.
🚀 Quickstart
Video inference
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
Image inference
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
✅ Intended Use
- Research Foundation: An ideal starting point for researchers focusing on Representation Learning or Model Efficiency in video understanding.
- SFT Starting Point: The recommended backbone for training your own Offline SFT or Real-Time Streaming variants.
- Architecture Exploration: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
⚠️ Limitations & Future Outlook
- Base Model Nature: This checkpoint is pretrained only and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
- Performance Benchmarking: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like Qwen2.5-VL. Closing this gap is the core focus of our ongoing iterations.
- Scalable Distributed Training: The current training pipeline is optimized for architectural validation. We are migrating to the Megatron-LM framework to leverage 3D parallelism (Tensor, Pipeline, and Data Parallelism) for larger-scale pre-training.
- Open-Source Commitment: In the next major release, we will officially open-source the complete training codebase (integrated with Megatron-LM) and more diverse datasets to the community.
🧩 Requirements
- Python: 3.10+
- PyTorch: 1.13.1+ (GPU strongly recommended)
- Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- CPU-only: PyTorch 2.4.0
- Transformers: required with
trust_remote_code=Truefor this model family (due toauto_mapcustom code) - Optional (recommended): FlashAttention 2 (
attn_implementation="flash_attention_2") - Video decode:
- streaming demo imports OpenCV (
cv2) - offline demo relies on the processor's video loading backend
- streaming demo imports OpenCV (
For full environment setup (including optional FlashAttention2 extras), see the top-level repository README.md.
⚠️ Notes
- This is a base model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
- The Python source files in this directory are referenced via
auto_mapinconfig.json.
🌟 Our Mission & Community Invitation
We have filled the gap in cross-attention-based foundation models for video understanding.
We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
Citation
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}
- Downloads last month
- 14