File size: 9,986 Bytes
b66ac48 178eb0e c53c874 b66ac48 c53c874 b66ac48 178eb0e b66ac48 588a9e7 b66ac48 c53c874 b66ac48 292cf17 b66ac48 03be5cf c53c874 b66ac48 c53c874 b66ac48 c53c874 b66ac48 c53c874 b66ac48 178eb0e c88a91c c53c874 292cf17 c53c874 346f4aa cc88616 9e792e7 346f4aa c53c874 178eb0e b66ac48 cc88616 292cf17 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 45d5886 b66ac48 292cf17 b66ac48 c53c874 cc88616 c53c874 e29cd8d c53c874 b66ac48 bcaab98 b66ac48 ee25722 b66ac48 3603a08 b66ac48 c53c874 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | ---
title: MOSS-VL-Instruct-0408
date: 2026-04-08
category: Multimodal-LLM
status: SFT
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/MOSS-VL-Base-0408
tags:
- SFT
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---
<p align="center">
<img src="assets/logo.png" width="320"/>
</p>
# MOSS-VL-Instruct-0408
## ๐ Introduction
MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks โ including image understanding, OCR, document parsing, visual reasoning, and instruction following โ and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
### โจ Highlights
- ๐ฌ **Outstanding Video Understanding** โ A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
- ๐ผ๏ธ **Strong General Multimodal Perception** โ Robust image understanding, fine-grained object recognition, OCR, and document parsing.
- ๐ฌ **Reliable Instruction Following** โ Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
---
## ๐ Model Architecture
**MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline โ eliminating the need for heavy pre-processing.
<p align="center">
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>
## ๐งฉ Absolute Timestamps
To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.
<p align="center">
<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>
## ๐งฌ Cross-attention RoPE (XRoPE)
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionโlanguage architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
<p align="center">
<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>
## ๐ Model Performance
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
### ๐ Key Highlights
* **๐ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
* **๐๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
* **๐ง Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
* **๐ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
<p align="center">
<img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
</p>
## ๐ Quickstart
### ๐ ๏ธ Installation
```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```
### ๐ Run Inference
<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_image_generate(
processor,
prompt=prompt,
image=image_path,
shortest_edge=4096,
longest_edge=16777216,
multi_image_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)
```
</details>
<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe this video."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_video_generate(
processor,
prompt=prompt,
video=video_path,
shortest_edge=4096,
longest_edge=16777216,
video_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
video_fps=1.0,
min_frames=1,
max_frames=256,
num_extract_threads=4,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)
```
</details>
<details>
<summary><strong>Batched offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
queries = [
{
"prompt": "Describe sample A.",
"images": [],
"videos": ["data/sample_a.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
{
"prompt": "Describe sample B.",
"images": [],
"videos": ["data/sample_b.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
]
with torch.no_grad():
result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)
texts = [item["text"] for item in result["results"]]
```
</details>
## ๐ง Limitations and Future Work
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
> [!NOTE]
> We welcome community feedback and contributions on any of these directions.
## ๐ Citation
```bibtex
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
note = {GitHub repository}
}
```
|