File size: 9,011 Bytes
d8f73b2 b06e30d d8f73b2 b06e30d d8f73b2 b06e30d d8f73b2 b06e30d d8f73b2 82b620c d8f73b2 22a5abe cfcd3f8 22a5abe d8f73b2 a5ceaa6 b06e30d 69037ab d8f73b2 3411f0a df5bb20 d8f73b2 db9bcfa d8f73b2 69037ab d8f73b2 69037ab d8f73b2 69037ab d8f73b2 69037ab d8f73b2 69037ab ff44c91 f8017c7 69037ab d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 f8017c7 d8f73b2 69037ab d8f73b2 1f81311 d8f73b2 d60acc8 d8f73b2 69037ab 1f81311 d8f73b2 69037ab d8f73b2 ff44c91 d8f73b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | ---
title: MOSS-VL-Base-0408
date: 2026-04-08
category: Multimodal-LLM
status: Base
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
tags:
- Base
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---
<p align="center">
<img src="assets/logo.png" width="320"/>
</p>
# MOSS-VL-Base-0408
## π Introduction
MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation.
Specifically, the pretraining pipeline is structured into the following four progressive stages:
- Stage 1: Vision-language alignment
- Stage 2: Large-scale multimodal pretraining
- Stage 3: High-quality multimodal pretraining
- Stage 4: Annealing and long-context extension
### β¨ Highlights
- π **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formatsβfrom high-resolution photographs and dense document scans to ultra-wide screenshots.
- ποΈ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.
## π Model Architecture
**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.
<p align="center">
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>
## π§© Absolute Timestamps
To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.
<p align="center">
<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>
## 𧬠Cross-attention RoPE (XRoPE)
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.
<p align="center">
<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>
## π Quickstart
### π οΈ Installation
```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```
### π Run Inference
<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_image_generate(
processor,
prompt="",
image=image_path,
shortest_edge=4096,
longest_edge=16777216,
multi_image_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
use_template=False,
)
print(text)
```
</details>
<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_video_generate(
processor,
prompt="",
video=video_path,
shortest_edge=4096,
longest_edge=16777216,
video_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
video_fps=1.0,
min_frames=1,
max_frames=256,
num_extract_threads=4,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
use_template=False,
)
print(text)
```
</details>
<details>
<summary><strong>Batched offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
}
shared_video_media_kwargs = {
"min_pixels": 4096,
"max_pixels": 16777216,
"video_max_pixels": 201326592,
"video_fps": 1.0,
"min_frames": 1,
"max_frames": 256,
}
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
queries = [
{
"images": ["data/sample_a.jpg"],
"generate_kwargs": dict(shared_generate_kwargs),
},
{
"videos": ["data/sample_b.mp4"],
"media_kwargs": dict(shared_video_media_kwargs),
"generate_kwargs": dict(shared_generate_kwargs),
},
]
with torch.no_grad():
result = model.offline_batch_generate(
processor,
queries,
session_states=None,
vision_chunked_length=64,
)
texts = [item["text"] for item in result["results"]]
```
</details>
## π§ Limitations and Future Work
MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:
- π **Stronger OCR, Especially for Long Documents** β We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
- π¬ **Expanded Extremely Long Video Understanding** β We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.
> [!NOTE]
> We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.
## π Citation
```bibtex
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
note = {GitHub repository}
}
```
|