findcard12138 commited on
Commit
c88a91c
ยท
verified ยท
1 Parent(s): 41a98a2

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -72,7 +72,7 @@ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to
72
 
73
  ## ๐Ÿ“Š Model Performance
74
 
75
- We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
76
 
77
  ### ๐ŸŒŸ Key Highlights
78
 
@@ -99,6 +99,11 @@ conda activate moss_vl
99
  pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
100
  ```
101
 
 
 
 
 
 
102
  ### ๐Ÿƒ Run Inference
103
 
104
 
@@ -288,7 +293,7 @@ texts = [item["text"] for item in result["results"]]
288
 
289
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
290
 
291
- - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code understanding capabilities, especially in multimodal contexts (e.g., reasoning over diagrams, charts, screenshots, and code-bearing visual inputs).
292
  - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ€” complementing this offline checkpoint.
293
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
294
  - โณ **Longer Context for Hour-Scale Video** โ€” Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
 
72
 
73
  ## ๐Ÿ“Š Model Performance
74
 
75
+ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
76
 
77
  ### ๐ŸŒŸ Key Highlights
78
 
 
99
  pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
100
  ```
101
 
102
+ Validated setup notes:
103
+
104
+ - CUDA runtime used for validation: `12.8`
105
+ - Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
106
+
107
  ### ๐Ÿƒ Run Inference
108
 
109
 
 
293
 
294
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
295
 
296
+ - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
297
  - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ€” complementing this offline checkpoint.
298
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
299
  - โณ **Longer Context for Hour-Scale Video** โ€” Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
assets/3d-rope.png CHANGED

Git LFS Details

  • SHA256: 8800079720aa8bd81b3e0e03e272343d46945ee80df9f2772cea2b8f26e65dd8
  • Pointer size: 131 Bytes
  • Size of remote file: 194 kB

Git LFS Details

  • SHA256: aa84af011196536d73dbcc255aa267179aa433ea697e2e95334cbd41481d4575
  • Pointer size: 131 Bytes
  • Size of remote file: 208 kB
assets/benchmark_table.png CHANGED

Git LFS Details

  • SHA256: 166de71650e926c3b61a60ff7dbd1f69a17b1ab6516dd35678263f486d383a38
  • Pointer size: 131 Bytes
  • Size of remote file: 189 kB

Git LFS Details

  • SHA256: 7d13ed0087f9e764f8c84dc3c45cdf79c35c8ac6cef54ddc3ed6f690268ebae2
  • Pointer size: 131 Bytes
  • Size of remote file: 368 kB
assets/timestamp_input copy.svg ADDED