Spaces:

OpenMOSS-Team
/

MOSS-VL

Runtime error

App Files Files Community

MOSS-VL / README.md

huazzeng

Release current version

6a72916 28 days ago

preview code

raw

history blame contribute delete

3.15 kB

	---
	title: MOSS-VL
	emoji: 🌱
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 5.50.0
	python_version: "3.10"
	app_file: app.py
	pinned: false
	short_description: 'MOSS-VL: Toward Advanced Video Understanding'
	license: apache-2.0
	models:
	- OpenMOSS-Team/MOSS-VL-Instruct-0408
	tags:
	- vision-language
	- multimodal
	- image-understanding
	- video-understanding
	---

	# MOSS-VL-Instruct-0408 Demo

	An interactive demo for MOSS-VL-Instruct-0408, an 11B-parameter instruction-tuned vision-language model developed by the [OpenMOSS Team](https://github.com/OpenMOSS). Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in video understanding.

	## Highlights

	- Outstanding Video Understanding — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct.
	- Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks).
	- Reliable Instruction Following — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.

	## Architecture

	MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning:

	- Millisecond-level latency for instantaneous responses
	- Natively supports interleaved modalities — processes complex sequences of images and videos within a unified pipeline
	- Absolute Timestamps injected alongside each sampled frame for precise temporal perception
	- Cross-attention RoPE (XRoPE) — maps text tokens and video patches into a unified 3D coordinate space (time, height, width)

	## Capabilities

	- Image Understanding: scene description, object recognition, visual reasoning
	- Video Understanding: temporal reasoning, action recognition, key event localization
	- OCR & Document Parsing: text extraction and structured document parsing
	- Visual Question Answering: open-ended questions about any image or video

	## Usage

	1. Upload an image or video using the input panel, or pick one of the example prompts on the welcome screen
	2. Enter your question or prompt in the text box
	3. (Optional) Adjust generation parameters in the sidebar's Generation Settings
	4. Press Enter or click Send to get the model's response

	> Note: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster.

	## Model Details

	- Model: [OpenMOSS-Team/MOSS-VL-Instruct-0408](https://huggingface.co/OpenMOSS-Team/MOSS-VL-Instruct-0408)
	- Parameters: 11B (BF16)
	- Base Model: MOSS-VL-Base-0408
	- License: Apache 2.0

	## Citation

	```bibtex
	@misc{moss_vl_2026,
	title = {{MOSS-VL Technical Report}},
	author = {OpenMOSS Team},
	year = {2026},
	howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
	note = {GitHub repository}
	}
	```