--- title: MOSS-VL-Instruct-0408 date: 2026-04-08 category: Multimodal-LLM status: SFT language: - en library_name: transformers pipeline_tag: video-text-to-text license: apache-2.0 base_model: fnlp-vision/MOSS-VL-Base-0408 tags: - SFT - Video-Understanding - Image-Understanding - MOSS-VL - OpenMOSS - multimodal - video - vision-language ---

# MOSS-VL-Instruct-0408 ## ๐Ÿ“Œ Introduction MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding. Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks โ€” including image understanding, OCR, document parsing, visual reasoning, and instruction following โ€” and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition. ### โœจ Highlights - ๐ŸŽฌ **Outstanding Video Understanding** โ€” A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU. - ๐Ÿ–ผ๏ธ **Strong General Multimodal Perception** โ€” Robust image understanding, fine-grained object recognition, OCR, and document parsing. - ๐Ÿ’ฌ **Reliable Instruction Following** โ€” Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data. --- ## ๐Ÿ— Model Architecture **MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline โ€” eliminating the need for heavy pre-processing.

MOSS-VL Architecture

## ๐Ÿงฉ Absolute Timestamps To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.

Timestamped Sequence Input Illustration

## ๐Ÿงฌ Cross-attention RoPE (XRoPE) MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionโ€“language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).

MOSS-VL mRoPE Architecture Illustration

## ๐Ÿ“Š Model Performance We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**. ### ๐ŸŒŸ Key Highlights * **๐Ÿš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**). * **๐Ÿ‘๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`. * **๐Ÿง  Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites. * **๐Ÿ“„ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.

MOSS-VL Benchmark Results

## ๐Ÿš€ Quickstart ### ๐Ÿ› ๏ธ Installation ```bash conda create -n moss_vl python=3.12 pip -y conda activate moss_vl pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt ``` ### ๐Ÿƒ Run Inference
Single-image offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" image_path = "data/example_image.jpg" prompt = "Describe this image." def load_model(checkpoint: str): processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) return model, processor model, processor = load_model(checkpoint) text = model.offline_image_generate( processor, prompt=prompt, image=image_path, shortest_edge=4096, longest_edge=16777216, multi_image_max_pixels=201326592, patch_size=16, temporal_patch_size=1, merge_size=2, image_mean=[0.5, 0.5, 0.5], image_std=[0.5, 0.5, 0.5], max_new_tokens=256, temperature=1.0, top_k=50, top_p=1.0, repetition_penalty=1.0, do_sample=False, vision_chunked_length=64, ) print(text) ```
Single-video offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" video_path = "data/example_video.mp4" prompt = "Describe this video." def load_model(checkpoint: str): processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) return model, processor model, processor = load_model(checkpoint) text = model.offline_video_generate( processor, prompt=prompt, video=video_path, shortest_edge=4096, longest_edge=16777216, video_max_pixels=201326592, patch_size=16, temporal_patch_size=1, merge_size=2, video_fps=1.0, min_frames=1, max_frames=256, num_extract_threads=4, image_mean=[0.5, 0.5, 0.5], image_std=[0.5, 0.5, 0.5], max_new_tokens=256, temperature=1.0, top_k=50, top_p=1.0, repetition_penalty=1.0, do_sample=False, vision_chunked_length=64, ) print(text) ```
Batched offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) queries = [ { "prompt": "Describe sample A.", "images": [], "videos": ["data/sample_a.mp4"], "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256}, "generate_kwargs": { "temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0, "do_sample": False, }, }, { "prompt": "Describe sample B.", "images": [], "videos": ["data/sample_b.mp4"], "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256}, "generate_kwargs": { "temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0, "do_sample": False, }, }, ] with torch.no_grad(): result = model.offline_batch_generate(processor, queries, vision_chunked_length=64) texts = [item["text"] for item in result["results"]] ```
## ๐Ÿšง Limitations and Future Work MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further: - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts. - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation. > [!NOTE] > We welcome community feedback and contributions on any of these directions. ## ๐Ÿ“œ Citation ```bibtex @misc{moss_vl_2026, title = {{MOSS-VL Technical Report}}, author = {OpenMOSS Team}, year = {2026}, howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}}, note = {GitHub repository} } ```