README.md · OpenMOSS-Team/MOSS-VL-Base-0408 at main

File size: 9,011 Bytes

d8f73b2
b06e30d
d8f73b2
 
b06e30d
d8f73b2
 
 
 
 
 
b06e30d
d8f73b2
 
 
 
 
 
 
 
 
 
 
 
 
b06e30d
d8f73b2
 
 
82b620c
d8f73b2
22a5abe
cfcd3f8
22a5abe
d8f73b2
a5ceaa6
 
 
 
b06e30d
69037ab
d8f73b2
3411f0a
df5bb20
d8f73b2
 
 
 
db9bcfa
d8f73b2
 
 
 
 
69037ab
d8f73b2
69037ab
d8f73b2
 
 
 
 
69037ab
d8f73b2
69037ab
d8f73b2
 
 
 
 
 
69037ab
ff44c91
f8017c7
69037ab
 
 
 
 
 
 
d8f73b2
 
f8017c7
d8f73b2
 
 
 
 
 
 
 
f8017c7
d8f73b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8017c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8f73b2
f8017c7
 
d8f73b2
f8017c7
d8f73b2
f8017c7
 
 
 
 
 
 
 
d8f73b2
f8017c7
 
d8f73b2
 
f8017c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8f73b2
f8017c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8f73b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8017c7
 
 
 
d8f73b2
f8017c7
 
d8f73b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8017c7
d8f73b2
 
 
 
f8017c7
d8f73b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69037ab
d8f73b2
1f81311
d8f73b2
d60acc8
 
d8f73b2
69037ab
1f81311
d8f73b2
69037ab
d8f73b2
 
ff44c91
d8f73b2

---
title: MOSS-VL-Base-0408
date: 2026-04-08
category: Multimodal-LLM
status: Base
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
tags:
- Base
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-Base-0408

## 📌 Introduction

MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. 

Specifically, the pretraining pipeline is structured into the following four progressive stages:

- Stage 1: Vision-language alignment
- Stage 2: Large-scale multimodal pretraining
- Stage 3: High-quality multimodal pretraining
- Stage 4: Annealing and long-context extension

### ✨ Highlights

- 📐 **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
- 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.


## 🏗 Model Architecture

**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>

## 🧩 Absolute Timestamps

To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>

## 🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>


## 🚀 Quickstart
### 🛠️ Installation

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

### 🏃 Run Inference

<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt="",
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt="",
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}
shared_video_media_kwargs = {
    "min_pixels": 4096,
    "max_pixels": 16777216,
    "video_max_pixels": 201326592,
    "video_fps": 1.0,
    "min_frames": 1,
    "max_frames": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "images": ["data/sample_a.jpg"],
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_video_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
```

</details>

## 🚧 Limitations and Future Work

MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:

- 📄 **Stronger OCR, Especially for Long Documents** — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
- 🎬 **Expanded Extremely Long Video Understanding** — We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.

> [!NOTE]
> We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.

## 📜 Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}
```