OpenMOSS-Team
/

MOSS-VL-Instruct-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

findcard12138 commited on Apr 8

Commit

fd2f3e8

·

verified ·

1 Parent(s): c88a91c

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -294,7 +294,7 @@ texts = [item["text"] for item in result["results"]]
 MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
 - 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
-- ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants — complementing this offline checkpoint.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
 - ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
 - 🔊 **Audio Modality Integration** — Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video — speech, ambient sound, music, and their interaction with on-screen events.

 MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
 - 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
+- ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
 - ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
 - 🔊 **Audio Modality Integration** — Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video — speech, ambient sound, music, and their interaction with on-screen events.