File size: 7,904 Bytes

d6ed2a9
 
 
 
a71b33d
ad70ed1
 
d6ed2a9
 
 
 
 
a71b33d
d6ed2a9
 
a71b33d
b46bdc1
d6ed2a9
b46bdc1
d6ed2a9
b46bdc1
d6ed2a9
eeebae6
b46bdc1
d6ed2a9
b46bdc1
d6ed2a9
eeebae6
b46bdc1
b46713f
51907cf
b46713f
 
 
51907cf
d6ed2a9
 
fd3507c
d6ed2a9
 
 
 
 
2d66ef2
b46713f
 
d6ed2a9
 
 
 
 
32d2053
 
d6ed2a9
 
 
 
ad70ed1
d6ed2a9
361dd02
d6ed2a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad70ed1
32d2053
d6ed2a9
 
 
 
32d2053
 
 
 
d6ed2a9
 
 
 
 
ad70ed1
d6ed2a9
361dd02
d6ed2a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad70ed1
d6ed2a9
 
32d2053
 
b46713f
d6ed2a9
b46713f
 
 
d6ed2a9
b46713f
d6ed2a9
b46713f
 
 
 
d6ed2a9
 
 
 
 
a71b33d
 
d6ed2a9
 
 
 
 
 
 
b46bdc1
 
d6ed2a9
b46bdc1
 
 
 
a71b33d
b46bdc1
abc5575
 
 
 
 
 
 
 
d6ed2a9
 
 
a71b33d
d6ed2a9
 
a71b33d
 
d6ed2a9

---
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
model_type: video_mllama
tags:
- multimodal
- video
- vision-language
- mllama
- video-text-to-text
---

# MOSS-Video-Preview-Base

## Introduction

We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series.

> [!Important]
> This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT).

This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream:

- **Offline SFT**: instruction-following and reasoning on full video segments
- **Real-Time SFT**: low-latency streaming video understanding and response

## 🌟 Key Highlights

- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.

#### Model Architecture

**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:

<p align="center">
  <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
</p>

- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.

For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).

## 🚀 Quickstart

<details>
<summary><strong>Video inference</strong></summary>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    videos=[video_path],
    video_fps=1.0,
    video_minlen=8,
    video_maxlen=16,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

```



 </details>

<details>
<summary><strong>Image inference</strong></summary>

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.

image = Image.open(image_path).convert("RGB")

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    images=[image],
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))
```

</details>

## ✅ Intended Use

- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.

## ⚠️ Limitations & Future Outlook

- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.

## 🧩 Requirements

- **Python**: 3.10+
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- **CPU-only**: PyTorch 2.4.0
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
- **Video decode**:
  - streaming demo imports OpenCV (`cv2`)
  - offline demo relies on the processor's video loading backend

For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.




## ⚠️ Notes

- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
- The Python source files in this directory are referenced via `auto_map` in `config.json`.


> [!IMPORTANT]
> ### 🌟 Our Mission & Community Invitation
> **We have filled the gap in cross-attention-based foundation models for video understanding.** 
> 
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!


## Citation
```bibtex
@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
  note          = {GitHub repository}
}
```