File size: 9,986 Bytes
b66ac48
178eb0e
c53c874
b66ac48
 
 
 
 
 
 
c53c874
b66ac48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178eb0e
b66ac48
 
 
588a9e7
b66ac48
c53c874
b66ac48
292cf17
b66ac48
03be5cf
c53c874
 
b66ac48
 
 
 
 
 
c53c874
 
b66ac48
 
 
 
c53c874
 
 
b66ac48
 
 
 
 
c53c874
 
 
b66ac48
 
 
 
 
178eb0e
 
c88a91c
c53c874
292cf17
c53c874
 
346f4aa
cc88616
9e792e7
346f4aa
c53c874
 
 
178eb0e
b66ac48
 
cc88616
292cf17
 
 
 
 
 
 
 
 
 
b66ac48
45d5886
b66ac48
 
 
 
 
 
 
 
45d5886
 
b66ac48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45d5886
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b66ac48
 
45d5886
b66ac48
 
 
 
 
45d5886
b66ac48
 
 
 
 
 
 
 
45d5886
 
b66ac48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45d5886
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b66ac48
 
 
 
 
45d5886
 
 
 
 
 
 
 
 
b66ac48
 
 
 
 
45d5886
 
 
 
 
 
 
 
 
b66ac48
 
 
 
45d5886
b66ac48
 
 
 
 
 
292cf17
b66ac48
c53c874
 
cc88616
c53c874
e29cd8d
c53c874
 
 
 
b66ac48
 
bcaab98
b66ac48
 
ee25722
b66ac48
 
3603a08
b66ac48
 
c53c874
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: MOSS-VL-Instruct-0408
date: 2026-04-08
category: Multimodal-LLM
status: SFT
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/MOSS-VL-Base-0408
tags:
- SFT
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-Instruct-0408

## ๐Ÿ“Œ Introduction

MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks โ€” including image understanding, OCR, document parsing, visual reasoning, and instruction following โ€” and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.

### โœจ Highlights

- ๐ŸŽฌ **Outstanding Video Understanding** โ€” A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
- ๐Ÿ–ผ๏ธ **Strong General Multimodal Perception** โ€” Robust image understanding, fine-grained object recognition, OCR, and document parsing.
- ๐Ÿ’ฌ **Reliable Instruction Following** โ€” Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.


---

## ๐Ÿ— Model Architecture

**MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline โ€” eliminating the need for heavy pre-processing.

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>

## ๐Ÿงฉ Absolute Timestamps

To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>

## ๐Ÿงฌ Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionโ€“language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>

## ๐Ÿ“Š Model Performance

We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.

### ๐ŸŒŸ Key Highlights

*   **๐Ÿš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
*   **๐Ÿ‘๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
*   **๐Ÿง  Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
*   **๐Ÿ“„ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.


<p align="center">
    <img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
</p>

## ๐Ÿš€ Quickstart
### ๐Ÿ› ๏ธ Installation

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

### ๐Ÿƒ Run Inference


<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
prompt = "Describe this image."


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt=prompt,
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
)

print(text)
```

</details>

<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe this video."


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt=prompt,
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
)

print(text)
```

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

queries = [
    {
        "prompt": "Describe sample A.",
        "images": [],
        "videos": ["data/sample_a.mp4"],
        "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
        "generate_kwargs": {
            "temperature": 1.0,
            "top_k": 50,
            "top_p": 1.0,
            "max_new_tokens": 256,
            "repetition_penalty": 1.0,
            "do_sample": False,
        },
    },
    {
        "prompt": "Describe sample B.",
        "images": [],
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
        "generate_kwargs": {
            "temperature": 1.0,
            "top_k": 50,
            "top_p": 1.0,
            "max_new_tokens": 256,
            "repetition_penalty": 1.0,
            "do_sample": False,
        },
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)

texts = [item["text"] for item in result["results"]]
```

</details>

## ๐Ÿšง Limitations and Future Work

MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:

- ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
- ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.


> [!NOTE]
> We welcome community feedback and contributions on any of these directions.



## ๐Ÿ“œ Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},
  note          = {GitHub repository}
}
```