Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -73,7 +73,7 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
|
|
| 73 |
|
| 74 |
* **๐ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
|
| 75 |
* **๐๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
|
| 76 |
-
* **๐ง Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites
|
| 77 |
* **๐ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
|
| 78 |
|
| 79 |
|
|
@@ -82,9 +82,7 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
|
|
| 82 |
</p>
|
| 83 |
|
| 84 |
## ๐ Quickstart
|
| 85 |
-
### ๐ ๏ธ
|
| 86 |
-
|
| 87 |
-
Installation:
|
| 88 |
|
| 89 |
```bash
|
| 90 |
conda create -n moss_vl python=3.12 pip -y
|
|
@@ -281,8 +279,7 @@ texts = [item["text"] for item in result["results"]]
|
|
| 281 |
|
| 282 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 283 |
|
| 284 |
-
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits
|
| 285 |
-
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
|
| 286 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 287 |
|
| 288 |
|
|
|
|
| 73 |
|
| 74 |
* **๐ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
|
| 75 |
* **๐๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
|
| 76 |
+
* **๐ง Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
|
| 77 |
* **๐ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
|
| 78 |
|
| 79 |
|
|
|
|
| 82 |
</p>
|
| 83 |
|
| 84 |
## ๐ Quickstart
|
| 85 |
+
### ๐ ๏ธ Installation
|
|
|
|
|
|
|
| 86 |
|
| 87 |
```bash
|
| 88 |
conda create -n moss_vl python=3.12 pip -y
|
|
|
|
| 279 |
|
| 280 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 281 |
|
| 282 |
+
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
|
|
|
|
| 283 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 284 |
|
| 285 |
|