| --- |
| title: MOSS-VL-Instruct-0408 |
| date: 2026-04-08 |
| category: Multimodal-LLM |
| status: SFT |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: video-text-to-text |
| license: apache-2.0 |
| base_model: fnlp-vision/MOSS-VL-Base-0408 |
| tags: |
| - SFT |
| - Video-Understanding |
| - Image-Understanding |
| - MOSS-VL |
| - OpenMOSS |
| - multimodal |
| - video |
| - vision-language |
| --- |
| |
| <p align="center"> |
| <img src="assets/logo.png" width="320"/> |
| </p> |
|
|
| # MOSS-VL-Instruct-0408 |
|
|
| ## 📌 Introduction |
|
|
| MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding. |
|
|
| Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks — including image understanding, OCR, document parsing, visual reasoning, and instruction following — and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition. |
|
|
| ### ✨ Highlights |
|
|
| - 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU. |
| - 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing. |
| - 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data. |
|
|
|
|
| --- |
|
|
| ## 🏗 Model Architecture |
|
|
| **MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing. |
|
|
| <p align="center"> |
| <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/> |
| </p> |
| |
| ## 🧩 Absolute Timestamps |
|
|
| To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**. |
|
|
| <p align="center"> |
| <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/> |
| </p> |
| |
| ## 🧬 Cross-attention RoPE (XRoPE) |
|
|
| MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w). |
|
|
| <p align="center"> |
| <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/> |
| </p> |
| |
| ## 📊 Model Performance |
|
|
| We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**. |
|
|
| ### 🌟 Key Highlights |
|
|
| * **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**). |
| * **👁️ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`. |
| * **🧠 Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites. |
| * **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information. |
|
|
|
|
| <p align="center"> |
| <img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/> |
| </p> |
| |
| ## 🚀 Quickstart |
| ### 🛠️ Installation |
|
|
| ```bash |
| conda create -n moss_vl python=3.12 pip -y |
| conda activate moss_vl |
| pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt |
| ``` |
|
|
| ### 🏃 Run Inference |
|
|
|
|
| <details> |
| <summary><strong>Single-image offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| image_path = "data/example_image.jpg" |
| prompt = "Describe this image." |
| |
| |
| def load_model(checkpoint: str): |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| return model, processor |
| |
| |
| model, processor = load_model(checkpoint) |
| |
| text = model.offline_image_generate( |
| processor, |
| prompt=prompt, |
| image=image_path, |
| shortest_edge=4096, |
| longest_edge=16777216, |
| multi_image_max_pixels=201326592, |
| patch_size=16, |
| temporal_patch_size=1, |
| merge_size=2, |
| image_mean=[0.5, 0.5, 0.5], |
| image_std=[0.5, 0.5, 0.5], |
| max_new_tokens=256, |
| temperature=1.0, |
| top_k=50, |
| top_p=1.0, |
| repetition_penalty=1.0, |
| do_sample=False, |
| vision_chunked_length=64, |
| ) |
| |
| print(text) |
| ``` |
|
|
| </details> |
|
|
| <details> |
| <summary><strong>Single-video offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| video_path = "data/example_video.mp4" |
| prompt = "Describe this video." |
| |
| |
| def load_model(checkpoint: str): |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| return model, processor |
| |
| |
| model, processor = load_model(checkpoint) |
| |
| text = model.offline_video_generate( |
| processor, |
| prompt=prompt, |
| video=video_path, |
| shortest_edge=4096, |
| longest_edge=16777216, |
| video_max_pixels=201326592, |
| patch_size=16, |
| temporal_patch_size=1, |
| merge_size=2, |
| video_fps=1.0, |
| min_frames=1, |
| max_frames=256, |
| num_extract_threads=4, |
| image_mean=[0.5, 0.5, 0.5], |
| image_std=[0.5, 0.5, 0.5], |
| max_new_tokens=256, |
| temperature=1.0, |
| top_k=50, |
| top_p=1.0, |
| repetition_penalty=1.0, |
| do_sample=False, |
| vision_chunked_length=64, |
| ) |
| |
| print(text) |
| ``` |
|
|
| </details> |
|
|
| <details> |
| <summary><strong>Batched offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| |
| queries = [ |
| { |
| "prompt": "Describe sample A.", |
| "images": [], |
| "videos": ["data/sample_a.mp4"], |
| "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256}, |
| "generate_kwargs": { |
| "temperature": 1.0, |
| "top_k": 50, |
| "top_p": 1.0, |
| "max_new_tokens": 256, |
| "repetition_penalty": 1.0, |
| "do_sample": False, |
| }, |
| }, |
| { |
| "prompt": "Describe sample B.", |
| "images": [], |
| "videos": ["data/sample_b.mp4"], |
| "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256}, |
| "generate_kwargs": { |
| "temperature": 1.0, |
| "top_k": 50, |
| "top_p": 1.0, |
| "max_new_tokens": 256, |
| "repetition_penalty": 1.0, |
| "do_sample": False, |
| }, |
| }, |
| ] |
| |
| with torch.no_grad(): |
| result = model.offline_batch_generate(processor, queries, vision_chunked_length=64) |
| |
| texts = [item["text"] for item in result["results"]] |
| ``` |
|
|
| </details> |
|
|
| ## 🚧 Limitations and Future Work |
|
|
| MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further: |
|
|
| - 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts. |
| - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation. |
|
|
|
|
| > [!NOTE] |
| > We welcome community feedback and contributions on any of these directions. |
|
|
|
|
|
|
| ## 📜 Citation |
| ```bibtex |
| @misc{moss_vl_2026, |
| title = {{MOSS-VL Technical Report}}, |
| author = {OpenMOSS Team}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}}, |
| note = {GitHub repository} |
| } |
| ``` |
|
|