File size: 7,904 Bytes
d6ed2a9 a71b33d ad70ed1 d6ed2a9 a71b33d d6ed2a9 a71b33d b46bdc1 d6ed2a9 b46bdc1 d6ed2a9 b46bdc1 d6ed2a9 eeebae6 b46bdc1 d6ed2a9 b46bdc1 d6ed2a9 eeebae6 b46bdc1 b46713f 51907cf b46713f 51907cf d6ed2a9 fd3507c d6ed2a9 2d66ef2 b46713f d6ed2a9 32d2053 d6ed2a9 ad70ed1 d6ed2a9 361dd02 d6ed2a9 ad70ed1 32d2053 d6ed2a9 32d2053 d6ed2a9 ad70ed1 d6ed2a9 361dd02 d6ed2a9 ad70ed1 d6ed2a9 32d2053 b46713f d6ed2a9 b46713f d6ed2a9 b46713f d6ed2a9 b46713f d6ed2a9 a71b33d d6ed2a9 b46bdc1 d6ed2a9 b46bdc1 a71b33d b46bdc1 abc5575 d6ed2a9 a71b33d d6ed2a9 a71b33d d6ed2a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
model_type: video_mllama
tags:
- multimodal
- video
- vision-language
- mllama
- video-text-to-text
---
# MOSS-Video-Preview-Base
## Introduction
We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series.
> [!Important]
> This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT).
This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream:
- **Offline SFT**: instruction-following and reasoning on full video segments
- **Real-Time SFT**: low-latency streaming video understanding and response
## 🌟 Key Highlights
- **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
- **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
- **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research.
#### Model Architecture
**MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:
<p align="center">
<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
</p>
- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
- **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
- **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.
For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).
## 🚀 Quickstart
<details>
<summary><strong>Video inference</strong></summary>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
```
</details>
<details>
<summary><strong>Image inference</strong></summary>
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
```
</details>
## ✅ Intended Use
- **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding.
- **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants.
- **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.
## ⚠️ Limitations & Future Outlook
- **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
- **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations.
- **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training.
- **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community.
## 🧩 Requirements
- **Python**: 3.10+
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
- **CPU-only**: PyTorch 2.4.0
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
- **Video decode**:
- streaming demo imports OpenCV (`cv2`)
- offline demo relies on the processor's video loading backend
For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.
## ⚠️ Notes
- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
- The Python source files in this directory are referenced via `auto_map` in `config.json`.
> [!IMPORTANT]
> ### 🌟 Our Mission & Community Invitation
> **We have filled the gap in cross-attention-based foundation models for video understanding.**
>
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
## Citation
```bibtex
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}
```
|