---
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
model_type: video_mllama
tags:
- multimodal
- video
- vision-language
- mllama
- video-text-to-text
---

# MOSS-Video-Preview-Base

## Introduction

We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series.

> [!Important]
> This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT).

This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream:

- **Offline SFT**: instruction-following and reasoning on full video segments
- **Real-Time SFT**: low-latency streaming video understanding and response

## 🌟 Key Highlights

- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.

#### Model Architecture

**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:

<p align="center">
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
</p>

- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.

For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).

## 🚀 Quickstart

<details>
<summary><strong>Video inference</strong></summary>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    videos=[video_path],
    video_fps=1.0,
    video_minlen=8,
    video_maxlen=16,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

```


 </details>

<details>
<summary><strong>Image inference</strong></summary>

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.

image = Image.open(image_path).convert("RGB")

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    images=[image],
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))
```

</details>

## ✅ Intended Use

- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.

## ⚠️ Limitations & Future Outlook

- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.

## 🧩 Requirements

- **Python**: 3.10+
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- **CPU-only**: PyTorch 2.4.0
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
- **Video decode**:
  - streaming demo imports OpenCV (`cv2`)
  - offline demo relies on the processor's video loading backend

For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.


## ⚠️ Notes

- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
- The Python source files in this directory are referenced via `auto_map` in `config.json`.


> [!IMPORTANT]
> ### 🌟 Our Mission & Community Invitation
> **We have filled the gap in cross-attention-based foundation models for video understanding.** 
> 
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!


## Citation
```bibtex
@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
  note          = {GitHub repository}
}
```