findcard12138 commited on
Commit
fd2f3e8
ยท
verified ยท
1 Parent(s): c88a91c

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -294,7 +294,7 @@ texts = [item["text"] for item in result["results"]]
294
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
295
 
296
  - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
297
- - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ€” complementing this offline checkpoint.
298
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
299
  - โณ **Longer Context for Hour-Scale Video** โ€” Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
300
  - ๐Ÿ”Š **Audio Modality Integration** โ€” Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ€” speech, ambient sound, music, and their interaction with on-screen events.
 
294
  MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
295
 
296
  - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
297
+ - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants.
298
  - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
299
  - โณ **Longer Context for Hour-Scale Video** โ€” Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
300
  - ๐Ÿ”Š **Audio Modality Integration** โ€” Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ€” speech, ambient sound, music, and their interaction with on-screen events.