VST-3B

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

๐Ÿ“„ Paper | ๐ŸŒ Project Page | ๐Ÿ’ป Code | ๐Ÿค— Training Data

This is the 3B variant of Video Streaming Thinking (VST), a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.

Performance

Model OVO-Bench StreamingBench VideoMME LongVideoBench VideoHolmes
VST-3B 56.2 75.5 59.5 54.1 36.1

Citation

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}
Downloads last month
105
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Catalan258/VST-3B

Finetuned
(792)
this model
Quantizations
1 model

Paper for Catalan258/VST-3B