Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +7 -2
assets/3d-rope.png +2 -2
assets/benchmark_table.png +2 -2
assets/timestamp_input copy.svg +78 -0

README.md CHANGED Viewed

@@ -72,7 +72,7 @@ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to
 ## 📊 Model Performance
-We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
 ### 🌟 Key Highlights
@@ -99,6 +99,11 @@ conda activate moss_vl
 pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
 ```
 ### 🏃 Run Inference
@@ -288,7 +293,7 @@ texts = [item["text"] for item in result["results"]]
 MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
-- 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code understanding capabilities, especially in multimodal contexts (e.g., reasoning over diagrams, charts, screenshots, and code-bearing visual inputs).
 - ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants — complementing this offline checkpoint.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
 - ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.

 ## 📊 Model Performance
+We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
 ### 🌟 Key Highlights
 pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
 ```
+Validated setup notes:
+- CUDA runtime used for validation: `12.8`
+- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
 ### 🏃 Run Inference
 MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
+- 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
 - ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants — complementing this offline checkpoint.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
 - ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.

assets/3d-rope.png CHANGED Viewed

Git LFS Details

SHA256: 8800079720aa8bd81b3e0e03e272343d46945ee80df9f2772cea2b8f26e65dd8
Pointer size: 131 Bytes
Size of remote file: 194 kB

Git LFS Details

SHA256: aa84af011196536d73dbcc255aa267179aa433ea697e2e95334cbd41481d4575
Pointer size: 131 Bytes
Size of remote file: 208 kB

assets/benchmark_table.png CHANGED Viewed

Git LFS Details

SHA256: 166de71650e926c3b61a60ff7dbd1f69a17b1ab6516dd35678263f486d383a38
Pointer size: 131 Bytes
Size of remote file: 189 kB

Git LFS Details

SHA256: 7d13ed0087f9e764f8c84dc3c45cdf79c35c8ac6cef54ddc3ed6f690268ebae2
Pointer size: 131 Bytes
Size of remote file: 368 kB

assets/timestamp_input copy.svg ADDED Viewed