---
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
model_type: video_mllama
tags:
- multimodal
- video
- vision-language
- mllama
- video-text-to-text
---
# MOSS-Video-Preview-Base
## Introduction
We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series.
> [!Important]
> This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT).
This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream:
- **Offline SFT**: instruction-following and reasoning on full video segments
- **Real-Time SFT**: low-latency streaming video understanding and response
## 🌟 Key Highlights
- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.
#### Model Architecture
**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:
- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
## 🚀 Quickstart
Video inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
```
Image inference
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
```
## ✅ Intended Use
- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
## ⚠️ Limitations & Future Outlook
- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.
## 🧩 Requirements
- **Python**: 3.10+
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- **CPU-only**: PyTorch 2.4.0
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
- **Video decode**:
- streaming demo imports OpenCV (`cv2`)
- offline demo relies on the processor's video loading backend
For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.
## ⚠️ Notes
- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
- The Python source files in this directory are referenced via `auto_map` in `config.json`.
> [!IMPORTANT]
> ### 🌟 Our Mission & Community Invitation
> **We have filled the gap in cross-attention-based foundation models for video understanding.**
>
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
## Citation
```bibtex
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}
```