VST-3B

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

📄 Paper | 🌐 Project Page | 💻 Code | 🤗 Training Data

This is the 3B variant of Video Streaming Thinking (VST), a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.

Performance

Model	OVO-Bench	StreamingBench	VideoMME	LongVideoBench	VideoHolmes
VST-3B	56.2	75.5	59.5	54.1	36.1

Citation

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Catalan258/VST-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(825)

this model

Quantizations

1 model

Paper for Catalan258/VST-3B

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Paper • 2603.12262 • Published Mar 12 • 31