README.md · OpenMOSS-Team/MOSS-VL-Base-0408 at main

MOSS-VL-Base-0408 / README.md

CCCCyx

Update README.md

df5bb20 verified 2 days ago

preview code

raw

history blame contribute delete

9.01 kB

	---
	title: MOSS-VL-Base-0408
	date: 2026-04-08
	category: Multimodal-LLM
	status: Base
	language:
	- en
	library_name: transformers
	pipeline_tag: video-text-to-text
	license: apache-2.0
	tags:
	- Base
	- Video-Understanding
	- Image-Understanding
	- MOSS-VL
	- OpenMOSS
	- multimodal
	- video
	- vision-language
	---

	<p align="center">
	<img src="assets/logo.png" width="320"/>
	</p>

	# MOSS-VL-Base-0408

	## 📌 Introduction

	MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

	Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation.

	Specifically, the pretraining pipeline is structured into the following four progressive stages:

	- Stage 1: Vision-language alignment
	- Stage 2: Large-scale multimodal pretraining
	- Stage 3: High-quality multimodal pretraining
	- Stage 4: Annealing and long-context extension

	### ✨ Highlights

	- 📐 Native Dynamic Resolution MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
	- 🎞️ Native Interleaved Image & Video Inputs The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.


	## 🏗 Model Architecture

	MOSS-VL-Base-0408 adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.

	<p align="center">
	<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
	</p>

	## 🧩 Absolute Timestamps

	To help the model perceive the pacing and duration of events, MOSS-VL-Base-0408 injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

	<p align="center">
	<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
	</p>

	## 🧬 Cross-attention RoPE (XRoPE)

	MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

	<p align="center">
	<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
	</p>


	## 🚀 Quickstart
	### 🛠️ Installation

	```bash
	conda create -n moss_vl python=3.12 pip -y
	conda activate moss_vl
	pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
	```

	### 🏃 Run Inference

	<details>
	<summary><strong>Single-image offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	image_path = "data/example_image.jpg"


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)

	text = model.offline_image_generate(
	processor,
	prompt="",
	image=image_path,
	shortest_edge=4096,
	longest_edge=16777216,
	multi_image_max_pixels=201326592,
	patch_size=16,
	temporal_patch_size=1,
	merge_size=2,
	image_mean=[0.5, 0.5, 0.5],
	image_std=[0.5, 0.5, 0.5],
	max_new_tokens=256,
	temperature=1.0,
	top_k=50,
	top_p=1.0,
	repetition_penalty=1.0,
	do_sample=False,
	vision_chunked_length=64,
	use_template=False,
	)

	print(text)
	```

	</details>

	<details>
	<summary><strong>Single-video offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	video_path = "data/example_video.mp4"


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)

	text = model.offline_video_generate(
	processor,
	prompt="",
	video=video_path,
	shortest_edge=4096,
	longest_edge=16777216,
	video_max_pixels=201326592,
	patch_size=16,
	temporal_patch_size=1,
	merge_size=2,
	video_fps=1.0,
	min_frames=1,
	max_frames=256,
	num_extract_threads=4,
	image_mean=[0.5, 0.5, 0.5],
	image_std=[0.5, 0.5, 0.5],
	max_new_tokens=256,
	temperature=1.0,
	top_k=50,
	top_p=1.0,
	repetition_penalty=1.0,
	do_sample=False,
	vision_chunked_length=64,
	use_template=False,
	)

	print(text)
	```

	</details>

	<details>
	<summary><strong>Batched offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	shared_generate_kwargs = {
	"temperature": 1.0,
	"top_k": 50,
	"top_p": 1.0,
	"max_new_tokens": 256,
	"repetition_penalty": 1.0,
	"do_sample": False,
	}
	shared_video_media_kwargs = {
	"min_pixels": 4096,
	"max_pixels": 16777216,
	"video_max_pixels": 201326592,
	"video_fps": 1.0,
	"min_frames": 1,
	"max_frames": 256,
	}


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)
	queries = [
	{
	"images": ["data/sample_a.jpg"],
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	{
	"videos": ["data/sample_b.mp4"],
	"media_kwargs": dict(shared_video_media_kwargs),
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	]

	with torch.no_grad():
	result = model.offline_batch_generate(
	processor,
	queries,
	session_states=None,
	vision_chunked_length=64,
	)

	texts = [item["text"] for item in result["results"]]
	```

	</details>

	## 🚧 Limitations and Future Work

	MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:

	- 📄 Stronger OCR, Especially for Long Documents — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
	- 🎬 Expanded Extremely Long Video Understanding — We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.

	> [!NOTE]
	> We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.

	## 📜 Citation
	```bibtex
	@misc{moss_vl_2026,
	title = {{MOSS-VL Technical Report}},
	author = {OpenMOSS Team},
	year = {2026},
	howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
	note = {GitHub repository}
	}
	```