Spaces:
Runtime error
Runtime error
| title: MOSS-VL | |
| emoji: π± | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.50.0 | |
| python_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| short_description: 'MOSS-VL: Toward Advanced Video Understanding' | |
| license: apache-2.0 | |
| models: | |
| - OpenMOSS-Team/MOSS-VL-Instruct-0408 | |
| tags: | |
| - vision-language | |
| - multimodal | |
| - image-understanding | |
| - video-understanding | |
| # MOSS-VL-Instruct-0408 Demo | |
| An interactive demo for **MOSS-VL-Instruct-0408**, an 11B-parameter instruction-tuned vision-language model developed by the [OpenMOSS Team](https://github.com/OpenMOSS). Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in **video understanding**. | |
| ## Highlights | |
| - **Outstanding Video Understanding** β Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct. | |
| - **Strong General Multimodal Perception** β Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks). | |
| - **Reliable Instruction Following** β Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data. | |
| ## Architecture | |
| MOSS-VL adopts a **cross-attention-based architecture** that decouples visual encoding from cognitive reasoning: | |
| - Millisecond-level latency for instantaneous responses | |
| - Natively supports **interleaved modalities** β processes complex sequences of images and videos within a unified pipeline | |
| - **Absolute Timestamps** injected alongside each sampled frame for precise temporal perception | |
| - **Cross-attention RoPE (XRoPE)** β maps text tokens and video patches into a unified 3D coordinate space (time, height, width) | |
| ## Capabilities | |
| - **Image Understanding**: scene description, object recognition, visual reasoning | |
| - **Video Understanding**: temporal reasoning, action recognition, key event localization | |
| - **OCR & Document Parsing**: text extraction and structured document parsing | |
| - **Visual Question Answering**: open-ended questions about any image or video | |
| ## Usage | |
| 1. Upload an **image** or **video** using the input panel, or pick one of the example prompts on the welcome screen | |
| 2. Enter your question or prompt in the text box | |
| 3. (Optional) Adjust generation parameters in the sidebar's **Generation Settings** | |
| 4. Press **Enter** or click **Send** to get the model's response | |
| > **Note**: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster. | |
| ## Model Details | |
| - **Model**: [OpenMOSS-Team/MOSS-VL-Instruct-0408](https://huggingface.co/OpenMOSS-Team/MOSS-VL-Instruct-0408) | |
| - **Parameters**: 11B (BF16) | |
| - **Base Model**: MOSS-VL-Base-0408 | |
| - **License**: Apache 2.0 | |
| ## Citation | |
| ```bibtex | |
| @misc{moss_vl_2026, | |
| title = {{MOSS-VL Technical Report}}, | |
| author = {OpenMOSS Team}, | |
| year = {2026}, | |
| howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}}, | |
| note = {GitHub repository} | |
| } | |
| ``` | |