Upload folder using huggingface_hub
Browse files- .gitattributes +2 -0
- README.md +45 -55
- assets/MOSS-VL-benchmark.png +3 -0
- assets/benchmark_table.png +3 -0
.gitattributes
CHANGED
|
@@ -39,3 +39,5 @@ assets/logo.png filter=lfs diff=lfs merge=lfs -text
|
|
| 39 |
assets/structure.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 41 |
assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 39 |
assets/structure.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 41 |
assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
assets/MOSS-VL-benchmark.png filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
assets/benchmark_table.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
title: MOSS-VL-Instruct-0408
|
| 3 |
-
date:
|
| 4 |
category: Multimodal-LLM
|
| 5 |
status: SFT
|
| 6 |
language:
|
|
@@ -8,8 +8,7 @@ language:
|
|
| 8 |
library_name: transformers
|
| 9 |
pipeline_tag: video-text-to-text
|
| 10 |
license: apache-2.0
|
| 11 |
-
base_model:
|
| 12 |
-
- fnlp-vision/MOSS-VL-Base-0408
|
| 13 |
tags:
|
| 14 |
- SFT
|
| 15 |
- Video-Understanding
|
|
@@ -29,51 +28,45 @@ tags:
|
|
| 29 |
|
| 30 |
## ๐ Introduction
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
|
| 36 |
-
|
| 37 |
-
This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.
|
| 38 |
|
| 39 |
-
###
|
| 40 |
|
| 41 |
-
-
|
| 42 |
-
-
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
|
|
|
| 47 |
|
| 48 |
-
| Feature | Status | Description |
|
| 49 |
-
| :--- | :---: | :--- |
|
| 50 |
-
| **Model Loading** | โ
| Standard HF loading with `trust_remote_code=True` |
|
| 51 |
-
| **Image Understanding** | โ
| Single/Multi-image input support |
|
| 52 |
-
| **Video Understanding** | โ
| Native video frame sequence processing |
|
| 53 |
-
| **Mixed Inference** | โ
| Interleaved image and video inputs |
|
| 54 |
-
| **Offline Generation** | โ
| Optimized `offline_generate` & `offline_batch_generate` |
|
| 55 |
-
| **Benchmarks/Metrics** | โณ | Coming in future updates |
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
## ๐ Model Architecture
|
| 60 |
|
| 61 |
-
**MOSS-VL-Instruct-0408** adopts a
|
|
|
|
| 62 |
<p align="center">
|
| 63 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
| 64 |
</p>
|
| 65 |
|
| 66 |
|
| 67 |
-
##
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.
|
| 70 |
<p align="center">
|
| 71 |
<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
|
| 72 |
</p>
|
| 73 |
|
| 74 |
-
##
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.
|
| 77 |
<p align="center">
|
| 78 |
<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
|
| 79 |
</p>
|
|
@@ -81,9 +74,20 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
|
|
| 81 |
|
| 82 |
## ๐ Model Performance
|
| 83 |
|
| 84 |
-
We
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
<p align="center">
|
| 86 |
-
<img src="assets/
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
</p>
|
| 88 |
|
| 89 |
|
|
@@ -294,29 +298,9 @@ with torch.no_grad():
|
|
| 294 |
|
| 295 |
</details>
|
| 296 |
|
| 297 |
-
## Intended Use
|
| 298 |
-
|
| 299 |
-
- offline image understanding
|
| 300 |
-
- offline video understanding
|
| 301 |
-
- multimodal prompt experiments for release validation
|
| 302 |
-
- checkpoint-level inference integration and debugging
|
| 303 |
|
| 304 |
## Requirements
|
| 305 |
|
| 306 |
-
Core validated inference dependencies:
|
| 307 |
-
|
| 308 |
-
- `python==3.12.13`
|
| 309 |
-
- `torch==2.8.0+cu128`
|
| 310 |
-
- `torchvision==0.23.0+cu128`
|
| 311 |
-
- `transformers==4.57.1`
|
| 312 |
-
- `accelerate==1.12.0`
|
| 313 |
-
- `flash_attn==2.8.1`
|
| 314 |
-
- `torchcodec==0.7.0`
|
| 315 |
-
- `numpy==2.4.3`
|
| 316 |
-
- `pillow==12.1.1`
|
| 317 |
-
- `joblib==1.5.2`
|
| 318 |
-
- `einops==0.8.2`
|
| 319 |
-
|
| 320 |
Installation commands:
|
| 321 |
|
| 322 |
```bash
|
|
@@ -333,12 +317,18 @@ Validated setup notes:
|
|
| 333 |
|
| 334 |
## Limitations and Future Work
|
| 335 |
|
| 336 |
-
-
|
| 337 |
-
|
| 338 |
-
-
|
| 339 |
-
-
|
| 340 |
-
-
|
| 341 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 342 |
|
| 343 |
|
| 344 |
## Citation
|
|
@@ -350,4 +340,4 @@ Validated setup notes:
|
|
| 350 |
howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
|
| 351 |
note = {GitHub repository}
|
| 352 |
}
|
| 353 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
title: MOSS-VL-Instruct-0408
|
| 3 |
+
date: 2026-04-08
|
| 4 |
category: Multimodal-LLM
|
| 5 |
status: SFT
|
| 6 |
language:
|
|
|
|
| 8 |
library_name: transformers
|
| 9 |
pipeline_tag: video-text-to-text
|
| 10 |
license: apache-2.0
|
| 11 |
+
base_model: fnlp-vision/MOSS-VL-Base-0408
|
|
|
|
| 12 |
tags:
|
| 13 |
- SFT
|
| 14 |
- Video-Understanding
|
|
|
|
| 28 |
|
| 29 |
## ๐ Introduction
|
| 30 |
|
| 31 |
+
MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
|
| 32 |
|
| 33 |
+
Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks โ including image understanding, OCR, document parsing, visual reasoning, and instruction following โ and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
### Highlights
|
| 36 |
|
| 37 |
+
- ๐ฌ **Outstanding Video Understanding** โ A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
|
| 38 |
+
- ๐ผ๏ธ **Strong General Multimodal Perception** โ Robust image understanding, fine-grained object recognition, OCR, and document parsing.
|
| 39 |
+
- ๐ฌ **Reliable Instruction Following** โ Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
|
| 40 |
|
| 41 |
+
### Note on Variants
|
| 42 |
|
| 43 |
+
> [!IMPORTANT]
|
| 44 |
+
> **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
## ๐ Model Architecture
|
| 50 |
|
| 51 |
+
**MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline โ eliminating the need for heavy pre-processing.
|
| 52 |
+
|
| 53 |
<p align="center">
|
| 54 |
<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
|
| 55 |
</p>
|
| 56 |
|
| 57 |
|
| 58 |
+
## ๐งฉ Absolute Timestamps
|
| 59 |
+
|
| 60 |
+
To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.
|
| 61 |
|
|
|
|
| 62 |
<p align="center">
|
| 63 |
<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
|
| 64 |
</p>
|
| 65 |
|
| 66 |
+
## ๐งฌ Cross-attention RoPE (XRoPE)
|
| 67 |
+
|
| 68 |
+
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionโlanguage architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
|
| 69 |
|
|
|
|
| 70 |
<p align="center">
|
| 71 |
<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
|
| 72 |
</p>
|
|
|
|
| 74 |
|
| 75 |
## ๐ Model Performance
|
| 76 |
|
| 77 |
+
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
|
| 78 |
+
|
| 79 |
+
### Key Highlights
|
| 80 |
+
|
| 81 |
+
* **๐ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
|
| 82 |
+
* **๐๏ธ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
|
| 83 |
+
* **๐ง Robust Multimodal Reasoning**: Achieving **64.3**, MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `CVBench` and `VisuLogic`.
|
| 84 |
+
* **๐ Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
|
| 85 |
<p align="center">
|
| 86 |
+
<img src="assets/benchmark_table.png" alt="MOSS-VL Benchmark Table" width="100%"/>
|
| 87 |
+
</p>
|
| 88 |
+
|
| 89 |
+
<p align="center">
|
| 90 |
+
<img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
|
| 91 |
</p>
|
| 92 |
|
| 93 |
|
|
|
|
| 298 |
|
| 299 |
</details>
|
| 300 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 301 |
|
| 302 |
## Requirements
|
| 303 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 304 |
Installation commands:
|
| 305 |
|
| 306 |
```bash
|
|
|
|
| 317 |
|
| 318 |
## Limitations and Future Work
|
| 319 |
|
| 320 |
+
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 321 |
+
|
| 322 |
+
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code understanding capabilities, especially in multimodal contexts (e.g., reasoning over diagrams, charts, screenshots, and code-bearing visual inputs).
|
| 323 |
+
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ complementing this offline checkpoint.
|
| 324 |
+
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 325 |
+
- โณ **Longer Context for Hour-Scale Video** โ Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
|
| 326 |
+
- ๐ **Audio Modality Integration** โ Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ speech, ambient sound, music, and their interaction with on-screen events.
|
| 327 |
+
- ๐ **Parameter Scaling** โ Releasing additional model sizes across the MOSS-VL series to cover a wider range of compute budgets and deployment scenarios.
|
| 328 |
+
|
| 329 |
+
> [!NOTE]
|
| 330 |
+
> We welcome community feedback and contributions on any of these directions.
|
| 331 |
+
|
| 332 |
|
| 333 |
|
| 334 |
## Citation
|
|
|
|
| 340 |
howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
|
| 341 |
note = {GitHub repository}
|
| 342 |
}
|
| 343 |
+
```
|
assets/MOSS-VL-benchmark.png
ADDED
|
Git LFS Details
|
assets/benchmark_table.png
ADDED
|
Git LFS Details
|