OpenMOSS-Team
/

MOSS-VL-Instruct-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

findcard12138 commited on Apr 8

Commit

e29cd8d

·

verified ·

1 Parent(s): 03be5cf

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -296,9 +296,7 @@ MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and
 - 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
 - ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
-- ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
-- 🔊 **Audio Modality Integration** — Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video — speech, ambient sound, music, and their interaction with on-screen events.
-- 📐 **Parameter Scaling** — Releasing additional model sizes across the MOSS-VL series to cover a wider range of compute budgets and deployment scenarios.
 > [!NOTE]
 > We welcome community feedback and contributions on any of these directions.

 - 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
 - ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
 - 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
 > [!NOTE]
 > We welcome community feedback and contributions on any of these directions.