Upload folder using huggingface_hub
Browse files- README.md +7 -2
- assets/3d-rope.png +2 -2
- assets/benchmark_table.png +2 -2
- assets/timestamp_input copy.svg +78 -0
README.md
CHANGED
|
@@ -72,7 +72,7 @@ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to
|
|
| 72 |
|
| 73 |
## ๐ Model Performance
|
| 74 |
|
| 75 |
-
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception,
|
| 76 |
|
| 77 |
### ๐ Key Highlights
|
| 78 |
|
|
@@ -99,6 +99,11 @@ conda activate moss_vl
|
|
| 99 |
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
|
| 100 |
```
|
| 101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
### ๐ Run Inference
|
| 103 |
|
| 104 |
|
|
@@ -288,7 +293,7 @@ texts = [item["text"] for item in result["results"]]
|
|
| 288 |
|
| 289 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 290 |
|
| 291 |
-
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code
|
| 292 |
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ complementing this offline checkpoint.
|
| 293 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 294 |
- โณ **Longer Context for Hour-Scale Video** โ Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
|
|
|
|
| 72 |
|
| 73 |
## ๐ Model Performance
|
| 74 |
|
| 75 |
+
We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
|
| 76 |
|
| 77 |
### ๐ Key Highlights
|
| 78 |
|
|
|
|
| 99 |
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
|
| 100 |
```
|
| 101 |
|
| 102 |
+
Validated setup notes:
|
| 103 |
+
|
| 104 |
+
- CUDA runtime used for validation: `12.8`
|
| 105 |
+
- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`
|
| 106 |
+
|
| 107 |
### ๐ Run Inference
|
| 108 |
|
| 109 |
|
|
|
|
| 293 |
|
| 294 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 295 |
|
| 296 |
+
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
|
| 297 |
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ complementing this offline checkpoint.
|
| 298 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 299 |
- โณ **Longer Context for Hour-Scale Video** โ Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
|
assets/3d-rope.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
assets/benchmark_table.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
assets/timestamp_input copy.svg
ADDED
|
|