Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -294,7 +294,7 @@ texts = [item["text"] for item in result["results"]]
|
|
| 294 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 295 |
|
| 296 |
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
|
| 297 |
-
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime**
|
| 298 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 299 |
- โณ **Longer Context for Hour-Scale Video** โ Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
|
| 300 |
- ๐ **Audio Modality Integration** โ Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ speech, ambient sound, music, and their interaction with on-screen events.
|
|
|
|
| 294 |
MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
|
| 295 |
|
| 296 |
- ๐งฎ **Math & Code Reasoning** โ While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
|
| 297 |
+
- โก **Real-Time Streaming Variant** โ The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
|
| 298 |
- ๐ฏ **RL Post-Training** โ We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
|
| 299 |
- โณ **Longer Context for Hour-Scale Video** โ Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
|
| 300 |
- ๐ **Audio Modality Integration** โ Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ speech, ambient sound, music, and their interaction with on-screen events.
|