--- title: MOSS-VL emoji: 🌱 colorFrom: green colorTo: blue sdk: gradio sdk_version: 5.50.0 python_version: "3.10" app_file: app.py pinned: false short_description: 'MOSS-VL: Toward Advanced Video Understanding' license: apache-2.0 models: - OpenMOSS-Team/MOSS-VL-Instruct-0408 tags: - vision-language - multimodal - image-understanding - video-understanding --- # MOSS-VL-Instruct-0408 Demo An interactive demo for **MOSS-VL-Instruct-0408**, an 11B-parameter instruction-tuned vision-language model developed by the [OpenMOSS Team](https://github.com/OpenMOSS). Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in **video understanding**. ## Highlights - **Outstanding Video Understanding** — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct. - **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks). - **Reliable Instruction Following** — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data. ## Architecture MOSS-VL adopts a **cross-attention-based architecture** that decouples visual encoding from cognitive reasoning: - Millisecond-level latency for instantaneous responses - Natively supports **interleaved modalities** — processes complex sequences of images and videos within a unified pipeline - **Absolute Timestamps** injected alongside each sampled frame for precise temporal perception - **Cross-attention RoPE (XRoPE)** — maps text tokens and video patches into a unified 3D coordinate space (time, height, width) ## Capabilities - **Image Understanding**: scene description, object recognition, visual reasoning - **Video Understanding**: temporal reasoning, action recognition, key event localization - **OCR & Document Parsing**: text extraction and structured document parsing - **Visual Question Answering**: open-ended questions about any image or video ## Usage 1. Upload an **image** or **video** using the input panel, or pick one of the example prompts on the welcome screen 2. Enter your question or prompt in the text box 3. (Optional) Adjust generation parameters in the sidebar's **Generation Settings** 4. Press **Enter** or click **Send** to get the model's response > **Note**: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster. ## Model Details - **Model**: [OpenMOSS-Team/MOSS-VL-Instruct-0408](https://huggingface.co/OpenMOSS-Team/MOSS-VL-Instruct-0408) - **Parameters**: 11B (BF16) - **Base Model**: MOSS-VL-Base-0408 - **License**: Apache 2.0 ## Citation ```bibtex @misc{moss_vl_2026, title = {{MOSS-VL Technical Report}}, author = {OpenMOSS Team}, year = {2026}, howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}}, note = {GitHub repository} } ```