findcard12138 commited on
Commit
292cf17
Β·
verified Β·
1 Parent(s): 45d5886

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +23 -18
README.md CHANGED
@@ -32,13 +32,13 @@ MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series,
32
 
33
  Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks β€” including image understanding, OCR, document parsing, visual reasoning, and instruction following β€” and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
34
 
35
- ### Highlights
36
 
37
  - 🎬 **Outstanding Video Understanding** β€” A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
38
  - πŸ–ΌοΈ **Strong General Multimodal Perception** β€” Robust image understanding, fine-grained object recognition, OCR, and document parsing.
39
  - πŸ’¬ **Reliable Instruction Following** β€” Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
40
 
41
- ### Note on Variants
42
 
43
  > [!IMPORTANT]
44
  > **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
@@ -76,7 +76,7 @@ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to
76
 
77
  We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
78
 
79
- ### Key Highlights
80
 
81
  * **πŸš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
82
  * **πŸ‘οΈ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
@@ -93,6 +93,25 @@ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four
93
 
94
  ## πŸš€ Quickstart
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  <details>
97
  <summary><strong>Single-image offline inference (Python)</strong></summary>
98
 
@@ -276,23 +295,9 @@ texts = [item["text"] for item in result["results"]]
276
  </details>
277
 
278
 
279
- ## Requirements
280
-
281
- Installation commands:
282
-
283
- ```bash
284
- conda create -n moss_vl python=3.12 pip -y
285
- conda activate moss_vl
286
- pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
287
- ```
288
-
289
- Validated setup notes:
290
-
291
- - CUDA runtime used for validation: `12.8`
292
- - Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
293
 
294
 
295
- ## Limitations and Future Work
296
 
297
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
298
 
 
32
 
33
  Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks β€” including image understanding, OCR, document parsing, visual reasoning, and instruction following β€” and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
34
 
35
+ ### ✨ Highlights
36
 
37
  - 🎬 **Outstanding Video Understanding** β€” A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
38
  - πŸ–ΌοΈ **Strong General Multimodal Perception** β€” Robust image understanding, fine-grained object recognition, OCR, and document parsing.
39
  - πŸ’¬ **Reliable Instruction Following** β€” Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
40
 
41
+ ### πŸ“ Note on Variants
42
 
43
  > [!IMPORTANT]
44
  > **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
 
76
 
77
  We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
78
 
79
+ ### 🌟 Key Highlights
80
 
81
  * **πŸš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
82
  * **πŸ‘οΈ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
 
93
 
94
  ## πŸš€ Quickstart
95
 
96
+
97
+ ### πŸ› οΈ Requirements
98
+
99
+ Installation commands:
100
+
101
+ ```bash
102
+ conda create -n moss_vl python=3.12 pip -y
103
+ conda activate moss_vl
104
+ pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
105
+ ```
106
+
107
+ Validated setup notes:
108
+
109
+ - CUDA runtime used for validation: `12.8`
110
+ - Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
111
+
112
+ ### πŸƒ Run Inference
113
+
114
+
115
  <details>
116
  <summary><strong>Single-image offline inference (Python)</strong></summary>
117
 
 
295
  </details>
296
 
297
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
298
 
299
 
300
+ ## 🚧 Limitations and Future Work
301
 
302
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
303