--- title: MOSS-VL-Base-0408 date: 2026-04-08 category: Multimodal-LLM status: Base language: - en library_name: transformers pipeline_tag: video-text-to-text license: apache-2.0 tags: - Base - Video-Understanding - Image-Understanding - MOSS-VL - OpenMOSS - multimodal - video - vision-language ---

# MOSS-VL-Base-0408 ## πŸ“Œ Introduction MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding. Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. Specifically, the pretraining pipeline is structured into the following four progressive stages: - Stage 1: Vision-language alignment - Stage 2: Large-scale multimodal pretraining - Stage 3: High-quality multimodal pretraining - Stage 4: Annealing and long-context extension ### ✨ Highlights - πŸ“ **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formatsβ€”from high-resolution photographs and dense document scans to ultra-wide screenshots. - 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing. ## πŸ— Model Architecture **MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.

MOSS-VL Architecture

## 🧩 Absolute Timestamps To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

Timestamped Sequence Input Illustration

## 🧬 Cross-attention RoPE (XRoPE) MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

MOSS-VL mRoPE Architecture Illustration

## πŸš€ Quickstart ### πŸ› οΈ Installation ```bash conda create -n moss_vl python=3.12 pip -y conda activate moss_vl pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt ``` ### πŸƒ Run Inference
Single-image offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" image_path = "data/example_image.jpg" def load_model(checkpoint: str): processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) return model, processor model, processor = load_model(checkpoint) text = model.offline_image_generate( processor, prompt="", image=image_path, shortest_edge=4096, longest_edge=16777216, multi_image_max_pixels=201326592, patch_size=16, temporal_patch_size=1, merge_size=2, image_mean=[0.5, 0.5, 0.5], image_std=[0.5, 0.5, 0.5], max_new_tokens=256, temperature=1.0, top_k=50, top_p=1.0, repetition_penalty=1.0, do_sample=False, vision_chunked_length=64, use_template=False, ) print(text) ```
Single-video offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" video_path = "data/example_video.mp4" def load_model(checkpoint: str): processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) return model, processor model, processor = load_model(checkpoint) text = model.offline_video_generate( processor, prompt="", video=video_path, shortest_edge=4096, longest_edge=16777216, video_max_pixels=201326592, patch_size=16, temporal_patch_size=1, merge_size=2, video_fps=1.0, min_frames=1, max_frames=256, num_extract_threads=4, image_mean=[0.5, 0.5, 0.5], image_std=[0.5, 0.5, 0.5], max_new_tokens=256, temperature=1.0, top_k=50, top_p=1.0, repetition_penalty=1.0, do_sample=False, vision_chunked_length=64, use_template=False, ) print(text) ```
Batched offline inference (Python)
```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "path/to/checkpoint" shared_generate_kwargs = { "temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0, "do_sample": False, } shared_video_media_kwargs = { "min_pixels": 4096, "max_pixels": 16777216, "video_max_pixels": 201326592, "video_fps": 1.0, "min_frames": 1, "max_frames": 256, } def load_model(checkpoint: str): processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) return model, processor model, processor = load_model(checkpoint) queries = [ { "images": ["data/sample_a.jpg"], "generate_kwargs": dict(shared_generate_kwargs), }, { "videos": ["data/sample_b.mp4"], "media_kwargs": dict(shared_video_media_kwargs), "generate_kwargs": dict(shared_generate_kwargs), }, ] with torch.no_grad(): result = model.offline_batch_generate( processor, queries, session_states=None, vision_chunked_length=64, ) texts = [item["text"] for item in result["results"]] ```
## 🚧 Limitations and Future Work MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations: - πŸ“„ **Stronger OCR, Especially for Long Documents** β€” We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity. - 🎬 **Expanded Extremely Long Video Understanding** β€” We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts. > [!NOTE] > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it. ## πŸ“œ Citation ```bibtex @misc{moss_vl_2026, title = {{MOSS-VL Technical Report}}, author = {OpenMOSS Team}, year = {2026}, howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}}, note = {GitHub repository} } ```