findcard12138 commited on
Commit
cc88616
ยท
verified ยท
1 Parent(s): 56de5c4

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -6
README.md CHANGED
@@ -73,7 +73,7 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
73
 
74
  * **๐Ÿš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
75
  * **๐Ÿ‘๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
76
- * **๐Ÿง  Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `VisuLogic`.
77
  * **๐Ÿ“„ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
78
 
79
 
@@ -82,9 +82,7 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
82
  </p>
83
 
84
  ## ๐Ÿš€ Quickstart
85
- ### ๐Ÿ› ๏ธ Requirements
86
-
87
- Installation:
88
 
89
  ```bash
90
  conda create -n moss_vl python=3.12 pip -y
@@ -281,8 +279,7 @@ texts = [item["text"] for item in result["results"]]
281
 
282
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
283
 
284
- - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
285
- - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
286
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
287
 
288
 
 
73
 
74
  * **๐Ÿš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
75
  * **๐Ÿ‘๏ธ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
76
+ * **๐Ÿง  Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
77
  * **๐Ÿ“„ Reliable Document Understanding**: While the model is primarily optimized for general perception, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
78
 
79
 
 
82
  </p>
83
 
84
  ## ๐Ÿš€ Quickstart
85
+ ### ๐Ÿ› ๏ธ Installation
 
 
86
 
87
  ```bash
88
  conda create -n moss_vl python=3.12 pip -y
 
279
 
280
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
281
 
282
+ - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
 
283
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
284
 
285