| --- |
| title: MOSS-VL-Base-0408 |
| date: 2026-04-08 |
| category: Multimodal-LLM |
| status: Base |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: video-text-to-text |
| license: apache-2.0 |
| tags: |
| - Base |
| - Video-Understanding |
| - Image-Understanding |
| - MOSS-VL |
| - OpenMOSS |
| - multimodal |
| - video |
| - vision-language |
| --- |
| |
| <p align="center"> |
| <img src="assets/logo.png" width="320"/> |
| </p> |
|
|
| # MOSS-VL-Base-0408 |
|
|
| ## ๐ Introduction |
|
|
| MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding. |
|
|
| Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. |
|
|
| Specifically, the pretraining pipeline is structured into the following four progressive stages: |
|
|
| - Stage 1: Vision-language alignment |
| - Stage 2: Large-scale multimodal pretraining |
| - Stage 3: High-quality multimodal pretraining |
| - Stage 4: Annealing and long-context extension |
|
|
| ### โจ Highlights |
|
|
| - ๐ **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formatsโfrom high-resolution photographs and dense document scans to ultra-wide screenshots. |
| - ๐๏ธ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing. |
|
|
|
|
| ## ๐ Model Architecture |
|
|
| **MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding. |
|
|
| <p align="center"> |
| <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/> |
| </p> |
| |
| ## ๐งฉ Absolute Timestamps |
|
|
| To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage. |
|
|
| <p align="center"> |
| <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/> |
| </p> |
| |
| ## ๐งฌ Cross-attention RoPE (XRoPE) |
|
|
| MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning. |
|
|
| <p align="center"> |
| <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/> |
| </p> |
| |
|
|
| ## ๐ Quickstart |
| ### ๐ ๏ธ Installation |
|
|
| ```bash |
| conda create -n moss_vl python=3.12 pip -y |
| conda activate moss_vl |
| pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt |
| ``` |
|
|
| ### ๐ Run Inference |
|
|
| <details> |
| <summary><strong>Single-image offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| image_path = "data/example_image.jpg" |
| |
| |
| def load_model(checkpoint: str): |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| return model, processor |
| |
| |
| model, processor = load_model(checkpoint) |
| |
| text = model.offline_image_generate( |
| processor, |
| prompt="", |
| image=image_path, |
| shortest_edge=4096, |
| longest_edge=16777216, |
| multi_image_max_pixels=201326592, |
| patch_size=16, |
| temporal_patch_size=1, |
| merge_size=2, |
| image_mean=[0.5, 0.5, 0.5], |
| image_std=[0.5, 0.5, 0.5], |
| max_new_tokens=256, |
| temperature=1.0, |
| top_k=50, |
| top_p=1.0, |
| repetition_penalty=1.0, |
| do_sample=False, |
| vision_chunked_length=64, |
| use_template=False, |
| ) |
| |
| print(text) |
| ``` |
|
|
| </details> |
|
|
| <details> |
| <summary><strong>Single-video offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| video_path = "data/example_video.mp4" |
| |
| |
| def load_model(checkpoint: str): |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| return model, processor |
| |
| |
| model, processor = load_model(checkpoint) |
| |
| text = model.offline_video_generate( |
| processor, |
| prompt="", |
| video=video_path, |
| shortest_edge=4096, |
| longest_edge=16777216, |
| video_max_pixels=201326592, |
| patch_size=16, |
| temporal_patch_size=1, |
| merge_size=2, |
| video_fps=1.0, |
| min_frames=1, |
| max_frames=256, |
| num_extract_threads=4, |
| image_mean=[0.5, 0.5, 0.5], |
| image_std=[0.5, 0.5, 0.5], |
| max_new_tokens=256, |
| temperature=1.0, |
| top_k=50, |
| top_p=1.0, |
| repetition_penalty=1.0, |
| do_sample=False, |
| vision_chunked_length=64, |
| use_template=False, |
| ) |
| |
| print(text) |
| ``` |
|
|
| </details> |
|
|
| <details> |
| <summary><strong>Batched offline inference (Python)</strong></summary> |
|
|
| <br> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| checkpoint = "path/to/checkpoint" |
| shared_generate_kwargs = { |
| "temperature": 1.0, |
| "top_k": 50, |
| "top_p": 1.0, |
| "max_new_tokens": 256, |
| "repetition_penalty": 1.0, |
| "do_sample": False, |
| } |
| shared_video_media_kwargs = { |
| "min_pixels": 4096, |
| "max_pixels": 16777216, |
| "video_max_pixels": 201326592, |
| "video_fps": 1.0, |
| "min_frames": 1, |
| "max_frames": 256, |
| } |
| |
| |
| def load_model(checkpoint: str): |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| return model, processor |
| |
| |
| model, processor = load_model(checkpoint) |
| queries = [ |
| { |
| "images": ["data/sample_a.jpg"], |
| "generate_kwargs": dict(shared_generate_kwargs), |
| }, |
| { |
| "videos": ["data/sample_b.mp4"], |
| "media_kwargs": dict(shared_video_media_kwargs), |
| "generate_kwargs": dict(shared_generate_kwargs), |
| }, |
| ] |
| |
| with torch.no_grad(): |
| result = model.offline_batch_generate( |
| processor, |
| queries, |
| session_states=None, |
| vision_chunked_length=64, |
| ) |
| |
| texts = [item["text"] for item in result["results"]] |
| ``` |
|
|
| </details> |
|
|
| ## ๐ง Limitations and Future Work |
|
|
| MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations: |
|
|
| - ๐ **Stronger OCR, Especially for Long Documents** โ We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity. |
| - ๐ฌ **Expanded Extremely Long Video Understanding** โ We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts. |
|
|
| > [!NOTE] |
| > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it. |
|
|
| ## ๐ Citation |
| ```bibtex |
| @misc{moss_vl_2026, |
| title = {{MOSS-VL Technical Report}}, |
| author = {OpenMOSS Team}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}}, |
| note = {GitHub repository} |
| } |
| ``` |
|
|