findcard12138's picture
Upload folder using huggingface_hub
cc88616 verified
---
title: MOSS-VL-Instruct-0408
date: 2026-04-08
category: Multimodal-LLM
status: SFT
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/MOSS-VL-Base-0408
tags:
- SFT
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---
<p align="center">
<img src="assets/logo.png" width="320"/>
</p>
# MOSS-VL-Instruct-0408
## 📌 Introduction
MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.
Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks — including image understanding, OCR, document parsing, visual reasoning, and instruction following — and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
### ✨ Highlights
- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
- 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
- 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
---
## 🏗 Model Architecture
**MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing.
<p align="center">
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>
## 🧩 Absolute Timestamps
To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.
<p align="center">
<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>
## 🧬 Cross-attention RoPE (XRoPE)
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
<p align="center">
<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>
## 📊 Model Performance
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
### 🌟 Key Highlights
* **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
* **👁️ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
* **🧠 Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
* **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
<p align="center">
<img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
</p>
## 🚀 Quickstart
### 🛠️ Installation
```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```
### 🏃 Run Inference
<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_image_generate(
processor,
prompt=prompt,
image=image_path,
shortest_edge=4096,
longest_edge=16777216,
multi_image_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)
```
</details>
<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe this video."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_video_generate(
processor,
prompt=prompt,
video=video_path,
shortest_edge=4096,
longest_edge=16777216,
video_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
video_fps=1.0,
min_frames=1,
max_frames=256,
num_extract_threads=4,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)
```
</details>
<details>
<summary><strong>Batched offline inference (Python)</strong></summary>
<br>
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
queries = [
{
"prompt": "Describe sample A.",
"images": [],
"videos": ["data/sample_a.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
{
"prompt": "Describe sample B.",
"images": [],
"videos": ["data/sample_b.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
]
with torch.no_grad():
result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)
texts = [item["text"] for item in result["results"]]
```
</details>
## 🚧 Limitations and Future Work
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
- 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
- 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
> [!NOTE]
> We welcome community feedback and contributions on any of these directions.
## 📜 Citation
```bibtex
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
note = {GitHub repository}
}
```