VST-32B

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

📄 Paper | 🌐 Project Page | 💻 Code | 🤗 Training Data

This is the 32B variant of Video Streaming Thinking (VST), a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.

Performance

Model	OVO-Bench	StreamingBench	VideoMME	LongVideoBench	VideoHolmes
VST-32B	63.5	80.7	67.2	60.7	45.1

Citation

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}

Downloads last month: 14

Safetensors

Model size

33B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Catalan258/VST-32B

Base model

Qwen/Qwen2.5-VL-32B-Instruct

Finetuned

(69)

this model

Quantizations

1 model

Paper for Catalan258/VST-32B

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Paper • 2603.12262 • Published Mar 12 • 31