Upload folder using huggingface_hub

Files changed (4) hide show

.gitattributes CHANGED Viewed

@@ -41,3 +41,4 @@ tokenizer.json filter=lfs diff=lfs merge=lfs -text
 assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
 assets/MOSS-VL-benchmark.png filter=lfs diff=lfs merge=lfs -text
 assets/benchmark_table.png filter=lfs diff=lfs merge=lfs -text

 assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
 assets/MOSS-VL-benchmark.png filter=lfs diff=lfs merge=lfs -text
 assets/benchmark_table.png filter=lfs diff=lfs merge=lfs -text
+assets/radar.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -73,7 +73,7 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
 *   **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
 *   **👁️ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
-*   **🧠 Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `CVBench` and `VisuLogic`.
 *   **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
@@ -294,7 +294,7 @@ MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and
 ## 📜 Citation
 ```bibtex
 @misc{moss_vl_2026,
-  title         = {MOSS-VL Technical Report},
   author        = {OpenMOSS Team},
   year          = {2026},
   howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},

 *   **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
 *   **👁️ Outstanding Multimodal Perception**: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
+*   **🧠 Robust Multimodal Reasoning**: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `VisuLogic`.
 *   **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
 ## 📜 Citation
 ```bibtex
 @misc{moss_vl_2026,
+  title         = {{MOSS-VL Technical Report}},
   author        = {OpenMOSS Team},
   year          = {2026},
   howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},

assets/MOSS-VL-benchmark.png CHANGED Viewed

assets/radar.png ADDED Viewed