Video-Text-to-Text
Transformers
Safetensors
English
moss_vl
feature-extraction
Base
Video-Understanding
Image-Understanding
MOSS-VL
OpenMOSS
multimodal
video
vision-language
custom_code
Instructions to use OpenMOSS-Team/MOSS-VL-Base-0408 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-VL-Base-0408 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-VL-Base-0408", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| title: MOSS-VL-Base-0408 | |
| date: 2026-04-08 | |
| category: Multimodal-LLM | |
| status: Base | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: video-text-to-text | |
| license: apache-2.0 | |
| tags: | |
| - Base | |
| - Video-Understanding | |
| - Image-Understanding | |
| - MOSS-VL | |
| - OpenMOSS | |
| - multimodal | |
| - video | |
| - vision-language | |
| <p align="center"> | |
| <img src="assets/logo.png" width="320"/> | |
| </p> | |
| # MOSS-VL-Base-0408 | |
| ## ๐ Introduction | |
| MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding. | |
| Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation. | |
| Specifically, the pretraining pipeline is structured into the following four progressive stages: | |
| - Stage 1: Vision-language alignment | |
| - Stage 2: Large-scale multimodal pretraining | |
| - Stage 3: High-quality multimodal pretraining | |
| - Stage 4: Annealing and long-context extension | |
| ### โจ Highlights | |
| - ๐ **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formatsโfrom high-resolution photographs and dense document scans to ultra-wide screenshots. | |
| - ๐๏ธ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing. | |
| ## ๐ Model Architecture | |
| **MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding. | |
| <p align="center"> | |
| <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/> | |
| </p> | |
| ## ๐งฉ Absolute Timestamps | |
| To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage. | |
| <p align="center"> | |
| <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/> | |
| </p> | |
| ## ๐งฌ Cross-attention RoPE (XRoPE) | |
| MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning. | |
| <p align="center"> | |
| <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/> | |
| </p> | |
| ## ๐ Quickstart | |
| ### ๐ ๏ธ Installation | |
| ```bash | |
| conda create -n moss_vl python=3.12 pip -y | |
| conda activate moss_vl | |
| pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt | |
| ``` | |
| ### ๐ Run Inference | |
| <details> | |
| <summary><strong>Single-image offline inference (Python)</strong></summary> | |
| <br> | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| checkpoint = "path/to/checkpoint" | |
| image_path = "data/example_image.jpg" | |
| def load_model(checkpoint: str): | |
| processor = AutoProcessor.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| frame_extract_num_threads=1, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| return model, processor | |
| model, processor = load_model(checkpoint) | |
| text = model.offline_image_generate( | |
| processor, | |
| prompt="", | |
| image=image_path, | |
| shortest_edge=4096, | |
| longest_edge=16777216, | |
| multi_image_max_pixels=201326592, | |
| patch_size=16, | |
| temporal_patch_size=1, | |
| merge_size=2, | |
| image_mean=[0.5, 0.5, 0.5], | |
| image_std=[0.5, 0.5, 0.5], | |
| max_new_tokens=256, | |
| temperature=1.0, | |
| top_k=50, | |
| top_p=1.0, | |
| repetition_penalty=1.0, | |
| do_sample=False, | |
| vision_chunked_length=64, | |
| use_template=False, | |
| ) | |
| print(text) | |
| ``` | |
| </details> | |
| <details> | |
| <summary><strong>Single-video offline inference (Python)</strong></summary> | |
| <br> | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| checkpoint = "path/to/checkpoint" | |
| video_path = "data/example_video.mp4" | |
| def load_model(checkpoint: str): | |
| processor = AutoProcessor.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| frame_extract_num_threads=1, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| return model, processor | |
| model, processor = load_model(checkpoint) | |
| text = model.offline_video_generate( | |
| processor, | |
| prompt="", | |
| video=video_path, | |
| shortest_edge=4096, | |
| longest_edge=16777216, | |
| video_max_pixels=201326592, | |
| patch_size=16, | |
| temporal_patch_size=1, | |
| merge_size=2, | |
| video_fps=1.0, | |
| min_frames=1, | |
| max_frames=256, | |
| num_extract_threads=4, | |
| image_mean=[0.5, 0.5, 0.5], | |
| image_std=[0.5, 0.5, 0.5], | |
| max_new_tokens=256, | |
| temperature=1.0, | |
| top_k=50, | |
| top_p=1.0, | |
| repetition_penalty=1.0, | |
| do_sample=False, | |
| vision_chunked_length=64, | |
| use_template=False, | |
| ) | |
| print(text) | |
| ``` | |
| </details> | |
| <details> | |
| <summary><strong>Batched offline inference (Python)</strong></summary> | |
| <br> | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| checkpoint = "path/to/checkpoint" | |
| shared_generate_kwargs = { | |
| "temperature": 1.0, | |
| "top_k": 50, | |
| "top_p": 1.0, | |
| "max_new_tokens": 256, | |
| "repetition_penalty": 1.0, | |
| "do_sample": False, | |
| } | |
| shared_video_media_kwargs = { | |
| "min_pixels": 4096, | |
| "max_pixels": 16777216, | |
| "video_max_pixels": 201326592, | |
| "video_fps": 1.0, | |
| "min_frames": 1, | |
| "max_frames": 256, | |
| } | |
| def load_model(checkpoint: str): | |
| processor = AutoProcessor.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| frame_extract_num_threads=1, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| return model, processor | |
| model, processor = load_model(checkpoint) | |
| queries = [ | |
| { | |
| "images": ["data/sample_a.jpg"], | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| { | |
| "videos": ["data/sample_b.mp4"], | |
| "media_kwargs": dict(shared_video_media_kwargs), | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| ] | |
| with torch.no_grad(): | |
| result = model.offline_batch_generate( | |
| processor, | |
| queries, | |
| session_states=None, | |
| vision_chunked_length=64, | |
| ) | |
| texts = [item["text"] for item in result["results"]] | |
| ``` | |
| </details> | |
| ## ๐ง Limitations and Future Work | |
| MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations: | |
| - ๐ **Stronger OCR, Especially for Long Documents** โ We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity. | |
| - ๐ฌ **Expanded Extremely Long Video Understanding** โ We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts. | |
| > [!NOTE] | |
| > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it. | |
| ## ๐ Citation | |
| ```bibtex | |
| @misc{moss_vl_2026, | |
| title = {{MOSS-VL Technical Report}}, | |
| author = {OpenMOSS Team}, | |
| year = {2026}, | |
| howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}}, | |
| note = {GitHub repository} | |
| } | |
| ``` | |