MOSS-Video-Preview-Base

Introduction

We introduce MOSS-Video-Preview-Base, the pretrained foundation checkpoint in the MOSS-Video-Preview series.

This is a pretrained model checkpoint without supervised instruction tuning (no offline SFT / no Real-Time SFT).

This repo contains the pretrained weights that are intended to serve as the starting point for downstream:

Offline SFT: instruction-following and reasoning on full video segments
Real-Time SFT: low-latency streaming video understanding and response

🌟 Key Highlights

🧩 First Cross-Attention Base: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
🔄 Streaming-Ready Backbone: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
⚡ Extreme Efficiency: Optimized for Flash Attention 2 and compatible with NPU/CUDA platforms, providing a high-throughput starting point for long-video research.

Model Architecture

MOSS-Video-Preview-Base is the foundational checkpoint of the series, featuring a Pioneering Image-Video Unified Cross-Attention Architecture:

Model Architecture

Native Unified Design: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
Cross-Modal Projector: Powered by the proprietary VideoMllamaTextCrossAttention mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
Unified Spatio-Temporal Encoding: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.

For architecture diagrams and full system details, see the top-level repository: fnlp-vision/MOSS-Video-Preview.

🚀 Quickstart

Video inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    videos=[video_path],
    video_fps=1.0,
    video_minlen=8,
    video_maxlen=16,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

Image inference

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.

image = Image.open(image_path).convert("RGB")

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    images=[image],
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

✅ Intended Use

Research Foundation: An ideal starting point for researchers focusing on Representation Learning or Model Efficiency in video understanding.
SFT Starting Point: The recommended backbone for training your own Offline SFT or Real-Time Streaming variants.
Architecture Exploration: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.

⚠️ Limitations & Future Outlook

Base Model Nature: This checkpoint is pretrained only and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
Performance Benchmarking: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like Qwen2.5-VL. Closing this gap is the core focus of our ongoing iterations.
Scalable Distributed Training: The current training pipeline is optimized for architectural validation. We are migrating to the Megatron-LM framework to leverage 3D parallelism (Tensor, Pipeline, and Data Parallelism) for larger-scale pre-training.
Open-Source Commitment: In the next major release, we will officially open-source the complete training codebase (integrated with Megatron-LM) and more diverse datasets to the community.

🧩 Requirements

Python: 3.10+
PyTorch: 1.13.1+ (GPU strongly recommended)
Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
CPU-only: PyTorch 2.4.0
Transformers: required with trust_remote_code=True for this model family (due to auto_map custom code)
Optional (recommended): FlashAttention 2 (attn_implementation="flash_attention_2")
Video decode:
- streaming demo imports OpenCV (cv2)
- offline demo relies on the processor's video loading backend

For full environment setup (including optional FlashAttention2 extras), see the top-level repository README.md.

⚠️ Notes

This is a base model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
The Python source files in this directory are referenced via auto_map in config.json.

🌟 Our Mission & Community Invitation

We have filled the gap in cross-attention-based foundation models for video understanding.

We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!

Citation

@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
  note          = {GitHub repository}
}

Downloads last month: 14

Safetensors

Model size

11B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fnlp-vision/moss-video-preview-base

Finetunes

1 model