---
title: MOSS-VL-Base-0408
date: 2026-04-08
category: Multimodal-LLM
status: Base
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
tags:
- Base
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-Base-0408

## 📌 Introduction

MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. 

Specifically, the pretraining pipeline is structured into the following four progressive stages:

- Stage 1: Vision-language alignment
- Stage 2: Large-scale multimodal pretraining
- Stage 3: High-quality multimodal pretraining
- Stage 4: Annealing and long-context extension

### ✨ Highlights

- 📐 **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
- 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.


## 🏗 Model Architecture

**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>

## 🧩 Absolute Timestamps

To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>

## 🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>


## 🚀 Quickstart
### 🛠️ Installation

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

### 🏃 Run Inference

<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt="",
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt="",
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}
shared_video_media_kwargs = {
    "min_pixels": 4096,
    "max_pixels": 16777216,
    "video_max_pixels": 201326592,
    "video_fps": 1.0,
    "min_frames": 1,
    "max_frames": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "images": ["data/sample_a.jpg"],
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_video_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
```

</details>

## 🚧 Limitations and Future Work

MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:

- 📄 **Stronger OCR, Especially for Long Documents** — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
- 🎬 **Expanded Extremely Long Video Understanding** — We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.

> [!NOTE]
> We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.

## 📜 Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}
```