File size: 12,117 Bytes
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c04d819
 
 
84c8a9d
 
7cb17d8
c04d819
 
7cb17d8
84c8a9d
c04d819
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84c8a9d
 
 
 
 
c04d819
 
84c8a9d
 
 
c04d819
84c8a9d
d0f5738
 
 
 
 
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c04d819
84c8a9d
 
 
 
 
 
 
 
 
c04d819
84c8a9d
c04d819
 
84c8a9d
 
 
 
 
 
 
c04d819
 
 
 
 
 
 
 
 
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c04d819
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c04d819
84c8a9d
 
 
c04d819
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
c04d819
 
84c8a9d
 
 
c04d819
84c8a9d
d0f5738
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84c8a9d
97b9bd0
 
 
 
 
 
c04d819
 
 
97b9bd0
84c8a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
---
license: apache-2.0
language:
- en
tags:
- video
- video-question-answering
- multimodal
- vision-language
- qwen3-vl
- inference-time
- frame-selection
- clip
base_model: Qwen/Qwen3-VL-2B-Instruct
pipeline_tag: video-text-to-text
library_name: transformers
---

# DW-KhotTaeVL-2B-QueryFrames

**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**

A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
for video multiple-choice / decision-style question answering. **No model
weights are modified** β€” this method ships a CLIP-ViT-L/14-driven frame
selector plus an optional task-type-aware uniform-fallback policy as a
wrapper around the stock model.

On Video-MME mini at 8-frame budget, this recovers **~44 % of the
8-frame β†’ 64-frame stock baseline gap in MCQ mode, and ~56 % in
task-aware MCQ mode**, with zero training, zero parameter changes, and
~+0.4 s overhead per question.

## Scope

This release evaluates query-aware frame selection in a video
multiple-choice / decision-style QA setting. The selector may use
both the question text and the answer options as its CLIP query.
This is appropriate for Video-MME-style MCQ benchmarks and for
operational triage workflows where the system chooses among
predefined actions or alert categories (e.g. *normal passage /
restricted-zone entry / staff activity / false alarm*). It should
**not** be read as an open-ended video-understanding benchmark claim.

## Motivation

This work started from CCTV / video-security R&D, where only a small
number of frames can be sent to a VLM under latency and compute
constraints. The released artifact is a general-purpose query-aware
frame selector for video MCQ / decision-style video QA β€” not a
product-specific CCTV model.

## TL;DR

| Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ” vs stock |
|---|---:|---:|---:|
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
| **QueryFrames β€” MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
| **QueryFrames β€” Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
| Stock Qwen3-VL-2B (uniform 64 f) β€” ceiling | 0 | 73.7 % | +16.7 pp |

**12 of 12 task buckets non-negative; 8 strongly positive (β‰₯ 5 pp);
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).

> **Scope note.** This method targets short-clip, low-frame-budget
> video QA. The 300 Q numbers above are inside that design envelope.
> On the full 2700 Q split, overall Ξ” is **+0.22 pp** β€” see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.

## Why it works

Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
The gap is *by definition* a frame-coverage problem (same model, same
prompt, only frame budget changes). The bottleneck is **which 8
frames you give the model**, not the model itself.

DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
question* via CLIP-ViT-L/14 cosine similarity. For two task types
where 64-frame stock does *not* outperform 8-frame stock (Object
Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
hybrid policy reverts to uniform sampling β€” frame coverage is not
the bottleneck for those questions, and CLIP scoring can mis-pick.

## Pipeline

```
For each (video, question, options[A,B,C,D]):
    1. Sample 32 uniformly-spaced candidate frames.
    2. Encode question text with CLIP-ViT-L/14 β†’ 768-d text vector.
    3. Encode candidate frames β†’ 768-d image vectors.
    4. Cosine similarity β†’ pick top-8 (or uniform-8 if task is
       Object Reasoning / Temporal Reasoning, when task_type is given).
    5. Sort selected 8 frames by original temporal index.
    6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
    7. Extract letter from output.
```

## Usage

### Install dependencies

```bash
pip install torch transformers pillow decord huggingface_hub
```

### Minimal example

```python
from dw_queryframes import QueryFrames

fv = QueryFrames(device="auto")  # auto-resolves to cuda / mps / cpu

result = fv.answer_mcq(
    video_path="cooking.mp4",
    question="What does the chef do after pouring the oil into the pot?",
    options=[
        "Chops fresh green herbs",
        "Pours broth into the pot",
        "Stirs the oil in the pot",
        "Adds salt to the pot",
    ],
    task_type=None,  # or e.g. "Action Recognition" for task-aware MCQ mode
)
print(result["pred"])              # e.g. 'B'
print(result["frames_used"])       # 'query_aware' or 'uniform_fallback'
print(result["latency_clip_s"])    # ~0.4 s
print(result["latency_gen_s"])     # ~3 s on Apple M4 MPS
```

### Two operating modes

| Mode | Input | Use | Acc 300 Q |
|---|---|---|---:|
| **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
| **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |

Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
trigger the uniform-fallback path: `"Object Reasoning"` and
`"Temporal Reasoning"`. All other task strings (or `None`) use the
query-aware path.

> **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
> reported setting: it uses only the video, question, and answer
> options, with no task taxonomy.
>
> **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
> label supplied by Video-MME to route Object Reasoning and Temporal
> Reasoning questions to uniform sampling. This is a benchmark /
> controlled-workflow setting and is reported separately from default
> MCQ mode.

## Per-task accuracy on Video-MME mini 300 Q

| Task | n | Stock 8 f | QueryFrames | Ξ” |
|---|---:|---:|---:|---:|
| Action Reasoning      |  9 | 0.444 | 0.667 | **+0.222** ⭐ |
| Action Recognition    | 45 | 0.489 | 0.644 | **+0.156** ⭐ |
| Attribute Perception  | 37 | 0.730 | 0.811 | **+0.081** ⭐ |
| Counting Problem      | 34 | 0.265 | 0.353 | **+0.088** ⭐ |
| Information Synopsis  | 30 | 0.800 | 0.800 |  +0.000  |
| OCR Problems          | 23 | 0.391 | 0.609 | **+0.217** ⭐ |
| Object Reasoning      | 36 | 0.722 | 0.722 |  +0.000  |
| Object Recognition    | 51 | 0.588 | 0.667 | **+0.078** ⭐ |
| Spatial Perception    | 10 | 0.600 | 0.700 | **+0.100** ⭐ |
| Spatial Reasoning     |  9 | 0.778 | 1.000 | **+0.222** ⭐ |
| Temporal Perception   |  8 | 0.625 | 0.750 | **+0.125** ⭐ |
| Temporal Reasoning    |  8 | 0.250 | 0.250 |  +0.000  |

(Task-aware MCQ mode shown β€” task_type provided by Video-MME dataset.
⭐ = Ξ” β‰₯ 5 pp.)

## What this is NOT

- It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are
  unchanged. You can verify with the standard Hugging Face model
  hash check.
- It is **not** a leaderboard submission claim. The numbers above are
  on the publicly-available Video-MME mini split (300 Q, filtered to
  videos available locally via the standard mini chunks).
- It is **not** a replacement for fine-tuning when you have abundant
  domain data. For domain-shifted deployments (e.g. surveillance
  video), training-based adaptation may be required.

## Hardware

Runs on:

| Device | Notes |
|---|---|
| Apple M4 Max / M3 Pro (MPS, β‰₯ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
| NVIDIA A100 / H100 (CUDA) | works; faster |
| CPU (BF16-capable) | works but slow |

VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.

## Reproducibility

All numbers in this card are reproducible from a fresh clone of this
repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
(filtered to its `videos_chunked_01.zip` mini split).

The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
**self-contained** β€” they have no external project dependencies beyond
the local `dw_queryframes.py` module and standard Python /
Hugging Face / PyTorch packages.

### Three-command reproduction recipe

```bash
# Install deps
pip install torch transformers pillow decord huggingface_hub pandas pyarrow

# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
python eval_videomme.py --mode stock-uniform --n-questions 300 \
    --out-json stock_uniform_300q.json

# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
python eval_videomme.py --mode wild --n-questions 300 \
    --out-json wild_300q.json

# 3. Combine into task-aware MCQ mode via the hybrid policy
python build_hybrid.py \
    --wild-json wild_300q.json \
    --stock-uniform-json stock_uniform_300q.json \
    --out-json hybrid_300q.json
```

Expected results at 300 Q (greedy decoding, `do_sample=False`,
`max_pixels=262144`):

| Output | Accuracy | Ξ” vs stock |
|---|---:|---:|
| `stock_uniform_300q.json` | 0.5700 | β€” |
| `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
| `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |

This artifact is **fully deterministic** at greedy decoding β€”
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
in task-aware MCQ mode.

> **Caveat β€” sample size and split.** The 300 Q numbers above are on
> the `videos_chunked_01.zip` mini subset, which happens to be mostly
> short clips. For full-split numbers on Video-MME mini 2700 Q
> (balanced short / medium / long), see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
> below. This release is not a leaderboard submission.

## Scope on the full Video-MME mini (2700 Q)

After the 300 Q release, the eval was extended to the full 2700 Q
split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
53.33 %, **Ξ” +0.22 pp**.

This method targets short-clip, low-frame-budget video QA. The
2700 Q split is balanced across short / medium / long-form clips;
averaging across that range dilutes the gain to roughly neutral.

## Acknowledgements / Related Work

This project builds on Qwen3-VL-2B-Instruct and uses a simple
CLIP-based query-aware frame selection policy at inference time.

Query-aware and adaptive frame selection for Video-LLMs is an active
research direction. This release is an independent, simple CLIP-based
inference-time implementation focused on small-model video MCQ /
decision-style video QA under tight frame budgets.

## License

| Component | License | Source |
|---|---|---|
| This wrapper code | Apache 2.0 | this repo |
| Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
| Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |

When using or citing this work, please credit the base model:

> Built on Qwen3-VL-2B-Instruct (Apache 2.0).
> Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).

## Citation

```bibtex
@misc{dw-khottaevl-2b-queryframes-2026,
  author = {Deaw},
  title  = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
            for Video MCQ on Qwen3-VL-2B-Instruct},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
}

@misc{qwen3vl2025,
  title  = {Qwen3-VL: Multilingual Vision-Language Models},
  author = {Qwen Team},
  year   = {2025},
}

@inproceedings{radford2021clip,
  title  = {Learning Transferable Visual Models From Natural Language Supervision},
  author = {Radford, Alec and Kim, Jong Wook and others},
  booktitle = {ICML},
  year   = {2021},
}

@misc{videomme2024,
  title  = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
            of Multi-modal LLMs in Video Analysis},
  author = {Fu, Chaoyou and others},
  year   = {2024},
}
```

## Author

**Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) β€” independent
ML practitioner. Personal research release.

Issues / questions: open an issue on the model repo.