README.md · OpenMOSS-Team/MOSS-VL-Instruct-0408 at main

MOSS-VL-Instruct-0408 / README.md

findcard12138

Upload folder using huggingface_hub

cc88616 verified 2 days ago

preview code

raw

history blame contribute delete

9.99 kB

	---
	title: MOSS-VL-Instruct-0408
	date: 2026-04-08
	category: Multimodal-LLM
	status: SFT
	language:
	- en
	library_name: transformers
	pipeline_tag: video-text-to-text
	license: apache-2.0
	base_model: fnlp-vision/MOSS-VL-Base-0408
	tags:
	- SFT
	- Video-Understanding
	- Image-Understanding
	- MOSS-VL
	- OpenMOSS
	- multimodal
	- video
	- vision-language
	---

	<p align="center">
	<img src="assets/logo.png" width="320"/>
	</p>

	# MOSS-VL-Instruct-0408

	## 📌 Introduction

	MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

	Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks — including image understanding, OCR, document parsing, visual reasoning, and instruction following — and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.

	### ✨ Highlights

	- 🎬 Outstanding Video Understanding — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
	- 🖼️ Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
	- 💬 Reliable Instruction Following — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.


	---

	## 🏗 Model Architecture

	MOSS-VL-Instruct-0408 adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the millisecond level, enabling instantaneous responses to dynamic video streams. Natively supporting interleaved modalities, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing.

	<p align="center">
	<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
	</p>

	## 🧩 Absolute Timestamps

	To ensure the model accurately perceives the pacing and duration of events, MOSS-VL-Instruct-0408 injects absolute timestamps alongside each sampled frame, grounding the reasoning process in a precise temporal reference.

	<p align="center">
	<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
	</p>

	## 🧬 Cross-attention RoPE (XRoPE)

	MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).

	<p align="center">
	<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
	</p>

	## 📊 Model Performance

	We conducted a comprehensive evaluation of MOSS-VL-Instruct-0408 across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in general multimodal perception and complex video analysis.

	### 🌟 Key Highlights

	* 🚀 Leading Video Intelligence: MOSS-VL achieves a score of 65.8 in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms Qwen3-VL-8B-Instruct by 8.3 points).
	* 👁️ Outstanding Multimodal Perception: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
	* 🧠 Robust Multimodal Reasoning: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
	* 📄 Reliable Document Understanding: While the model is primarily optimized for general perception, MOSS-VL still delivers 83.9 on OCR and document analysis, ensuring dependable extraction of text and structured information.


	<p align="center">
	<img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
	</p>

	## 🚀 Quickstart
	### 🛠️ Installation

	```bash
	conda create -n moss_vl python=3.12 pip -y
	conda activate moss_vl
	pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
	```

	### 🏃 Run Inference


	<details>
	<summary><strong>Single-image offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	image_path = "data/example_image.jpg"
	prompt = "Describe this image."


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)

	text = model.offline_image_generate(
	processor,
	prompt=prompt,
	image=image_path,
	shortest_edge=4096,
	longest_edge=16777216,
	multi_image_max_pixels=201326592,
	patch_size=16,
	temporal_patch_size=1,
	merge_size=2,
	image_mean=[0.5, 0.5, 0.5],
	image_std=[0.5, 0.5, 0.5],
	max_new_tokens=256,
	temperature=1.0,
	top_k=50,
	top_p=1.0,
	repetition_penalty=1.0,
	do_sample=False,
	vision_chunked_length=64,
	)

	print(text)
	```

	</details>

	<details>
	<summary><strong>Single-video offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	video_path = "data/example_video.mp4"
	prompt = "Describe this video."


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)

	text = model.offline_video_generate(
	processor,
	prompt=prompt,
	video=video_path,
	shortest_edge=4096,
	longest_edge=16777216,
	video_max_pixels=201326592,
	patch_size=16,
	temporal_patch_size=1,
	merge_size=2,
	video_fps=1.0,
	min_frames=1,
	max_frames=256,
	num_extract_threads=4,
	image_mean=[0.5, 0.5, 0.5],
	image_std=[0.5, 0.5, 0.5],
	max_new_tokens=256,
	temperature=1.0,
	top_k=50,
	top_p=1.0,
	repetition_penalty=1.0,
	do_sample=False,
	vision_chunked_length=64,
	)

	print(text)
	```

	</details>

	<details>
	<summary><strong>Batched offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)

	queries = [
	{
	"prompt": "Describe sample A.",
	"images": [],
	"videos": ["data/sample_a.mp4"],
	"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
	"generate_kwargs": {
	"temperature": 1.0,
	"top_k": 50,
	"top_p": 1.0,
	"max_new_tokens": 256,
	"repetition_penalty": 1.0,
	"do_sample": False,
	},
	},
	{
	"prompt": "Describe sample B.",
	"images": [],
	"videos": ["data/sample_b.mp4"],
	"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
	"generate_kwargs": {
	"temperature": 1.0,
	"top_k": 50,
	"top_p": 1.0,
	"max_new_tokens": 256,
	"repetition_penalty": 1.0,
	"do_sample": False,
	},
	},
	]

	with torch.no_grad():
	result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)

	texts = [item["text"] for item in result["results"]]
	```

	</details>

	## 🚧 Limitations and Future Work

	MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:

	- 🧮 Math & Code Reasoning — While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
	- 🎯 RL Post-Training — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.


	> [!NOTE]
	> We welcome community feedback and contributions on any of these directions.



	## 📜 Citation
	```bibtex
	@misc{moss_vl_2026,
	title = {{MOSS-VL Technical Report}},
	author = {OpenMOSS Team},
	year = {2026},
	howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
	note = {GitHub repository}
	}
	```