Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -32,13 +32,13 @@ MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series,
|
|
| 32 |
|
| 33 |
Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks β including image understanding, OCR, document parsing, visual reasoning, and instruction following β and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
|
| 34 |
|
| 35 |
-
### Highlights
|
| 36 |
|
| 37 |
- π¬ **Outstanding Video Understanding** β A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
|
| 38 |
- πΌοΈ **Strong General Multimodal Perception** β Robust image understanding, fine-grained object recognition, OCR, and document parsing.
|
| 39 |
- π¬ **Reliable Instruction Following** β Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
|
| 40 |
|
| 41 |
-
### Note on Variants
|
| 42 |
|
| 43 |
> [!IMPORTANT]
|
| 44 |
> **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
|
|
@@ -76,7 +76,7 @@ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to
|
|
| 76 |
|
| 77 |
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
|
| 78 |
|
| 79 |
-
### Key Highlights
|
| 80 |
|
| 81 |
* **π Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
|
| 82 |
* **ποΈ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
|
|
@@ -93,6 +93,25 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
|
|
| 93 |
|
| 94 |
## π Quickstart
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
<details>
|
| 97 |
<summary><strong>Single-image offline inference (Python)</strong></summary>
|
| 98 |
|
|
@@ -276,23 +295,9 @@ texts = [item["text"] for item in result["results"]]
|
|
| 276 |
</details>
|
| 277 |
|
| 278 |
|
| 279 |
-
## Requirements
|
| 280 |
-
|
| 281 |
-
Installation commands:
|
| 282 |
-
|
| 283 |
-
```bash
|
| 284 |
-
conda create -n moss_vl python=3.12 pip -y
|
| 285 |
-
conda activate moss_vl
|
| 286 |
-
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
|
| 287 |
-
```
|
| 288 |
-
|
| 289 |
-
Validated setup notes:
|
| 290 |
-
|
| 291 |
-
- CUDA runtime used for validation: `12.8`
|
| 292 |
-
- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
|
| 293 |
|
| 294 |
|
| 295 |
-
## Limitations and Future Work
|
| 296 |
|
| 297 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 298 |
|
|
|
|
| 32 |
|
| 33 |
Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks β including image understanding, OCR, document parsing, visual reasoning, and instruction following β and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
|
| 34 |
|
| 35 |
+
### β¨ Highlights
|
| 36 |
|
| 37 |
- π¬ **Outstanding Video Understanding** β A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
|
| 38 |
- πΌοΈ **Strong General Multimodal Perception** β Robust image understanding, fine-grained object recognition, OCR, and document parsing.
|
| 39 |
- π¬ **Reliable Instruction Following** β Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
|
| 40 |
|
| 41 |
+
### π Note on Variants
|
| 42 |
|
| 43 |
> [!IMPORTANT]
|
| 44 |
> **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
|
|
|
|
| 76 |
|
| 77 |
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
|
| 78 |
|
| 79 |
+
### π Key Highlights
|
| 80 |
|
| 81 |
* **π Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
|
| 82 |
* **ποΈ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
|
|
|
|
| 93 |
|
| 94 |
## π Quickstart
|
| 95 |
|
| 96 |
+
|
| 97 |
+
### π οΈ Requirements
|
| 98 |
+
|
| 99 |
+
Installation commands:
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
conda create -n moss_vl python=3.12 pip -y
|
| 103 |
+
conda activate moss_vl
|
| 104 |
+
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
Validated setup notes:
|
| 108 |
+
|
| 109 |
+
- CUDA runtime used for validation: `12.8`
|
| 110 |
+
- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
|
| 111 |
+
|
| 112 |
+
### π Run Inference
|
| 113 |
+
|
| 114 |
+
|
| 115 |
<details>
|
| 116 |
<summary><strong>Single-image offline inference (Python)</strong></summary>
|
| 117 |
|
|
|
|
| 295 |
</details>
|
| 296 |
|
| 297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
|
| 299 |
|
| 300 |
+
## π§ Limitations and Future Work
|
| 301 |
|
| 302 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 303 |
|