---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-VL-2B-Instruct
tags:
- video-understanding
- streaming
- proactive
- activation-model
- masked-diffusion
- multimodal
- plug-and-play
language:
- en
pipeline_tag: video-classification
model-index:
- name: STRIDE-2B
  results:
  - task:
      type: video-classification
      name: Proactive Streaming Activation
    dataset:
      type: custom
      name: OVO-Bench
    metrics:
    - type: accuracy
      value: 59.07
      name: Overall (w/ Qwen3-VL-8B)
  - task:
      type: video-classification
      name: Proactive Streaming Activation
    dataset:
      type: custom
      name: StreamingBench
    metrics:
    - type: accuracy
      value: 59.29
      name: Overall (w/ Qwen3-VL-8B)
  - task:
      type: video-classification
      name: Temporal Grounding
    dataset:
      type: custom
      name: ET-Bench
    metrics:
    - type: f1
      value: 62.8
      name: TVG F1
    - type: f1
      value: 10.7
      name: EPM F1
    - type: f1
      value: 24.6
      name: TAL F1
    - type: f1
      value: 36.5
      name: DVC F1
    - type: f1
      value: 28.5
      name: SLC F1
---

# STRIDE-2B

**STRIDE** (**S**tructured **T**emporal **R**efinement with **I**terative **DE**noising) is a lightweight proactive activation model for streaming video understanding.
It decides **when** a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.

<p align="center">
  <a href="https://arxiv.org/abs/2603.27593"><img src="https://img.shields.io/badge/arXiv-2603.27593-b31b1b" alt="arXiv"></a>
  <a href="https://interlive-team.github.io/STRIDE"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
  <a href="https://github.com/interlive-team/STRIDE"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="GitHub"></a>
  <a href="https://huggingface.co/interlive"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Model_Collection-yellow" alt="HF"></a>
</p>

> **Paper**: *STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding*
>
> Junho Kim\*, Hosu Lee\*, James M. Rehg, Minsu Kim, Yong Man Ro
>
> UIUC, KAIST, Google DeepMind

## What is STRIDE?

Existing streaming Video-LLMs are **reactive** — they only respond when a user explicitly asks a question. STRIDE makes them **proactive** by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating *when* to trigger a response.

The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond *now*?"), but a **span-structured** sequence modeling problem — the model must capture consistent onset (0 &rarr; 1), persistence (1 &rarr; 1), and offset (1 &rarr; 0) transitions. STRIDE achieves this through **masked diffusion** over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.

### Two-Stage Architecture

```
Video Stream
  │
  ▼
[STRIDE Activation Model]  ← this model (2B)
  │
  │ trigger (only if active)
  ▼
[Downstream Video-LLM]     ← frozen, any off-the-shelf
  │
  ▼
Response
```

- **Stage 1 — Activation (STRIDE):** Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
- **Stage 2 — Response (Downstream LLM):** When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully **plug-and-play** — compatible with any off-the-shelf Video-LLM.

## Results

### OVO-Bench (Online Video Understanding)

| Method | Real-Time Perception | Backward Tracing | Forward Active Responding | Overall |
|---|:---:|:---:|:---:|:---:|
| Flash-VStream-7B | 28.37 | 27.38 | 45.09 | 33.61 |
| Dispider | 54.55 | 36.06 | 34.72 | 41.78 |
| TimeChat-Online-7B | 58.60 | 42.00 | 36.40 | 45.60 |
| QueryStream-7B | 61.40 | 42.10 | 39.03 | 47.51 |
| StreamAgent-7B | 61.30 | 41.70 | 45.40 | 49.40 |
| **STRIDE** + Gemma3-4B | 60.93 | 34.87 | 55.73 | 50.51 |
| **STRIDE** + InternVL3-8B | 67.72 | 45.23 | 58.00 | 56.98 |
| **STRIDE** + Qwen3-VL-8B | 69.68 | 47.83 | 59.70 | **59.07** |

### StreamingBench (Streaming Comprehension)

| Method | Real-Time Visual | Omni-Source | Contextual | Overall |
|---|:---:|:---:|:---:|:---:|
| Flash-VStream-7B | 23.23 | 26.00 | 24.12 | 24.04 |
| VideoLLM-Online-8B | 35.99 | 28.45 | 26.55 | 32.48 |
| Dispider | 67.63 | 35.66 | 33.61 | 53.12 |
| StreamAgent-7B | 74.31 | 36.26 | 34.62 | 57.02 |
| **STRIDE** + Gemma3-4B | 60.00 | 36.80 | 38.80 | 50.14 |
| **STRIDE** + InternVL3-8B | 72.45 | 39.20 | 38.80 | 57.58 |
| **STRIDE** + Qwen3-VL-8B | 74.24 | 41.30 | 39.90 | **59.29** |

### ET-Bench (Temporal Grounding, Activation-Only)

| Model | Params | TVG | EPM | TAL | DVC | SLC | Avg |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| *Temporal-Localization Specialized* | | | | | | | |
| VTimeLLM | 7B | 7.6 | 1.9 | 18.2 | 12.4 | 8.7 | 9.8 |
| TimeChat | 7B | 26.2 | 3.9 | 10.1 | 16.6 | 5.6 | 12.5 |
| VTG-LLM | 7B | 15.9 | 3.7 | 14.4 | **40.2** | 20.8 | 19.0 |
| LITA | 13B | 22.2 | 4.6 | 18.0 | <u>39.7</u> | 21.0 | 21.1 |
| ETChat | 5B | <u>38.6</u> | 10.2 | **30.8** | 38.4 | <u>24.4</u> | <u>28.5</u> |
| *Streaming Baselines* | | | | | | | |
| VideoLLM-Online | 8B | 13.2 | 3.8 | 9.1 | 24.0 | 9.9 | 12.0 |
| Dispider | 9B | 36.1 | **15.5** | <u>27.3</u> | 33.8 | 18.8 | 26.3 |
| StreamBridge | 8B | 34.3 | – | 24.3 | 38.3 | 22.6 | – |
| *Ours* | | | | | | | |
| **STRIDE** | **2B** | **62.8** | <u>10.7</u> | 24.6 | 36.5 | **28.5** | **32.6** |

STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.

## Usage

For the full streaming inference pipeline and evaluation scripts, please refer to the [STRIDE GitHub repository](https://github.com/interlive-team/STRIDE).

## Training

- **Architecture:** `Qwen3VLForProactiveMDM` (Qwen3-VL backbone with a temporal activation head)
- **Base model:** [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
- **Training data:** Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM)
## Model Variants

| Model | Params | Description |
|---|---|---|
| [**STRIDE-2B**](https://huggingface.co/interlive/STRIDE-2B) (this) | 2B | Default activation model |
| STRIDE-4B | 4B | Scaled variant with improved accuracy |

## Citation

```bibtex
@article{kim2026stride,
  title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
  author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
  journal={arXiv preprint arXiv:2603.27593},
  year={2026}
}
```

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).