--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3-VL-2B-Instruct tags: - video-understanding - streaming - proactive - activation-model - masked-diffusion - multimodal - plug-and-play language: - en pipeline_tag: video-classification model-index: - name: STRIDE-2B results: - task: type: video-classification name: Proactive Streaming Activation dataset: type: custom name: OVO-Bench metrics: - type: accuracy value: 59.07 name: Overall (w/ Qwen3-VL-8B) - task: type: video-classification name: Proactive Streaming Activation dataset: type: custom name: StreamingBench metrics: - type: accuracy value: 59.29 name: Overall (w/ Qwen3-VL-8B) - task: type: video-classification name: Temporal Grounding dataset: type: custom name: ET-Bench metrics: - type: f1 value: 62.8 name: TVG F1 - type: f1 value: 10.7 name: EPM F1 - type: f1 value: 24.6 name: TAL F1 - type: f1 value: 36.5 name: DVC F1 - type: f1 value: 28.5 name: SLC F1 --- # STRIDE-2B **STRIDE** (**S**tructured **T**emporal **R**efinement with **I**terative **DE**noising) is a lightweight proactive activation model for streaming video understanding. It decides **when** a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.

arXiv Project Page GitHub HF

> **Paper**: *STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding* > > Junho Kim\*, Hosu Lee\*, James M. Rehg, Minsu Kim, Yong Man Ro > > UIUC, KAIST, Google DeepMind ## What is STRIDE? Existing streaming Video-LLMs are **reactive** — they only respond when a user explicitly asks a question. STRIDE makes them **proactive** by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating *when* to trigger a response. The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond *now*?"), but a **span-structured** sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through **masked diffusion** over a temporal activation window, jointly predicting and iteratively refining activation signals across the window. ### Two-Stage Architecture ``` Video Stream │ ▼ [STRIDE Activation Model] ← this model (2B) │ │ trigger (only if active) ▼ [Downstream Video-LLM] ← frozen, any off-the-shelf │ ▼ Response ``` - **Stage 1 — Activation (STRIDE):** Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion. - **Stage 2 — Response (Downstream LLM):** When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully **plug-and-play** — compatible with any off-the-shelf Video-LLM. ## Results ### OVO-Bench (Online Video Understanding) | Method | Real-Time Perception | Backward Tracing | Forward Active Responding | Overall | |---|:---:|:---:|:---:|:---:| | Flash-VStream-7B | 28.37 | 27.38 | 45.09 | 33.61 | | Dispider | 54.55 | 36.06 | 34.72 | 41.78 | | TimeChat-Online-7B | 58.60 | 42.00 | 36.40 | 45.60 | | QueryStream-7B | 61.40 | 42.10 | 39.03 | 47.51 | | StreamAgent-7B | 61.30 | 41.70 | 45.40 | 49.40 | | **STRIDE** + Gemma3-4B | 60.93 | 34.87 | 55.73 | 50.51 | | **STRIDE** + InternVL3-8B | 67.72 | 45.23 | 58.00 | 56.98 | | **STRIDE** + Qwen3-VL-8B | 69.68 | 47.83 | 59.70 | **59.07** | ### StreamingBench (Streaming Comprehension) | Method | Real-Time Visual | Omni-Source | Contextual | Overall | |---|:---:|:---:|:---:|:---:| | Flash-VStream-7B | 23.23 | 26.00 | 24.12 | 24.04 | | VideoLLM-Online-8B | 35.99 | 28.45 | 26.55 | 32.48 | | Dispider | 67.63 | 35.66 | 33.61 | 53.12 | | StreamAgent-7B | 74.31 | 36.26 | 34.62 | 57.02 | | **STRIDE** + Gemma3-4B | 60.00 | 36.80 | 38.80 | 50.14 | | **STRIDE** + InternVL3-8B | 72.45 | 39.20 | 38.80 | 57.58 | | **STRIDE** + Qwen3-VL-8B | 74.24 | 41.30 | 39.90 | **59.29** | ### ET-Bench (Temporal Grounding, Activation-Only) | Model | Params | TVG | EPM | TAL | DVC | SLC | Avg | |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | *Temporal-Localization Specialized* | | | | | | | | | VTimeLLM | 7B | 7.6 | 1.9 | 18.2 | 12.4 | 8.7 | 9.8 | | TimeChat | 7B | 26.2 | 3.9 | 10.1 | 16.6 | 5.6 | 12.5 | | VTG-LLM | 7B | 15.9 | 3.7 | 14.4 | **40.2** | 20.8 | 19.0 | | LITA | 13B | 22.2 | 4.6 | 18.0 | 39.7 | 21.0 | 21.1 | | ETChat | 5B | 38.6 | 10.2 | **30.8** | 38.4 | 24.4 | 28.5 | | *Streaming Baselines* | | | | | | | | | VideoLLM-Online | 8B | 13.2 | 3.8 | 9.1 | 24.0 | 9.9 | 12.0 | | Dispider | 9B | 36.1 | **15.5** | 27.3 | 33.8 | 18.8 | 26.3 | | StreamBridge | 8B | 34.3 | – | 24.3 | 38.3 | 22.6 | – | | *Ours* | | | | | | | | | **STRIDE** | **2B** | **62.8** | 10.7 | 24.6 | 36.5 | **28.5** | **32.6** | STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines. ## Usage For the full streaming inference pipeline and evaluation scripts, please refer to the [STRIDE GitHub repository](https://github.com/interlive-team/STRIDE). ## Training - **Architecture:** `Qwen3VLForProactiveMDM` (Qwen3-VL backbone with a temporal activation head) - **Base model:** [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) - **Training data:** Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM) ## Model Variants | Model | Params | Description | |---|---|---| | [**STRIDE-2B**](https://huggingface.co/interlive/STRIDE-2B) (this) | 2B | Default activation model | | STRIDE-4B | 4B | Scaled variant with improved accuracy | ## Citation ```bibtex @article{kim2026stride, title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding}, author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man}, journal={arXiv preprint arXiv:2603.27593}, year={2026} } ``` ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).