Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 12,117 Bytes
84c8a9d c04d819 84c8a9d 7cb17d8 c04d819 7cb17d8 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d d0f5738 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d c04d819 84c8a9d d0f5738 84c8a9d 97b9bd0 c04d819 97b9bd0 84c8a9d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | ---
license: apache-2.0
language:
- en
tags:
- video
- video-question-answering
- multimodal
- vision-language
- qwen3-vl
- inference-time
- frame-selection
- clip
base_model: Qwen/Qwen3-VL-2B-Instruct
pipeline_tag: video-text-to-text
library_name: transformers
---
# DW-KhotTaeVL-2B-QueryFrames
**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
for video multiple-choice / decision-style question answering. **No model
weights are modified** β this method ships a CLIP-ViT-L/14-driven frame
selector plus an optional task-type-aware uniform-fallback policy as a
wrapper around the stock model.
On Video-MME mini at 8-frame budget, this recovers **~44 % of the
8-frame β 64-frame stock baseline gap in MCQ mode, and ~56 % in
task-aware MCQ mode**, with zero training, zero parameter changes, and
~+0.4 s overhead per question.
## Scope
This release evaluates query-aware frame selection in a video
multiple-choice / decision-style QA setting. The selector may use
both the question text and the answer options as its CLIP query.
This is appropriate for Video-MME-style MCQ benchmarks and for
operational triage workflows where the system chooses among
predefined actions or alert categories (e.g. *normal passage /
restricted-zone entry / staff activity / false alarm*). It should
**not** be read as an open-ended video-understanding benchmark claim.
## Motivation
This work started from CCTV / video-security R&D, where only a small
number of frames can be sent to a VLM under latency and compute
constraints. The released artifact is a general-purpose query-aware
frame selector for video MCQ / decision-style video QA β not a
product-specific CCTV model.
## TL;DR
| Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ vs stock |
|---|---:|---:|---:|
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
| **QueryFrames β MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
| **QueryFrames β Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
| Stock Qwen3-VL-2B (uniform 64 f) β ceiling | 0 | 73.7 % | +16.7 pp |
**12 of 12 task buckets non-negative; 8 strongly positive (β₯ 5 pp);
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
> **Scope note.** This method targets short-clip, low-frame-budget
> video QA. The 300 Q numbers above are inside that design envelope.
> On the full 2700 Q split, overall Ξ is **+0.22 pp** β see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.
## Why it works
Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
The gap is *by definition* a frame-coverage problem (same model, same
prompt, only frame budget changes). The bottleneck is **which 8
frames you give the model**, not the model itself.
DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
question* via CLIP-ViT-L/14 cosine similarity. For two task types
where 64-frame stock does *not* outperform 8-frame stock (Object
Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
hybrid policy reverts to uniform sampling β frame coverage is not
the bottleneck for those questions, and CLIP scoring can mis-pick.
## Pipeline
```
For each (video, question, options[A,B,C,D]):
1. Sample 32 uniformly-spaced candidate frames.
2. Encode question text with CLIP-ViT-L/14 β 768-d text vector.
3. Encode candidate frames β 768-d image vectors.
4. Cosine similarity β pick top-8 (or uniform-8 if task is
Object Reasoning / Temporal Reasoning, when task_type is given).
5. Sort selected 8 frames by original temporal index.
6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
7. Extract letter from output.
```
## Usage
### Install dependencies
```bash
pip install torch transformers pillow decord huggingface_hub
```
### Minimal example
```python
from dw_queryframes import QueryFrames
fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu
result = fv.answer_mcq(
video_path="cooking.mp4",
question="What does the chef do after pouring the oil into the pot?",
options=[
"Chops fresh green herbs",
"Pours broth into the pot",
"Stirs the oil in the pot",
"Adds salt to the pot",
],
task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode
)
print(result["pred"]) # e.g. 'B'
print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
print(result["latency_clip_s"]) # ~0.4 s
print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
```
### Two operating modes
| Mode | Input | Use | Acc 300 Q |
|---|---|---|---:|
| **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
| **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |
Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
trigger the uniform-fallback path: `"Object Reasoning"` and
`"Temporal Reasoning"`. All other task strings (or `None`) use the
query-aware path.
> **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
> reported setting: it uses only the video, question, and answer
> options, with no task taxonomy.
>
> **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
> label supplied by Video-MME to route Object Reasoning and Temporal
> Reasoning questions to uniform sampling. This is a benchmark /
> controlled-workflow setting and is reported separately from default
> MCQ mode.
## Per-task accuracy on Video-MME mini 300 Q
| Task | n | Stock 8 f | QueryFrames | Ξ |
|---|---:|---:|---:|---:|
| Action Reasoning | 9 | 0.444 | 0.667 | **+0.222** β |
| Action Recognition | 45 | 0.489 | 0.644 | **+0.156** β |
| Attribute Perception | 37 | 0.730 | 0.811 | **+0.081** β |
| Counting Problem | 34 | 0.265 | 0.353 | **+0.088** β |
| Information Synopsis | 30 | 0.800 | 0.800 | +0.000 |
| OCR Problems | 23 | 0.391 | 0.609 | **+0.217** β |
| Object Reasoning | 36 | 0.722 | 0.722 | +0.000 |
| Object Recognition | 51 | 0.588 | 0.667 | **+0.078** β |
| Spatial Perception | 10 | 0.600 | 0.700 | **+0.100** β |
| Spatial Reasoning | 9 | 0.778 | 1.000 | **+0.222** β |
| Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** β |
| Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
(Task-aware MCQ mode shown β task_type provided by Video-MME dataset.
β = Ξ β₯ 5 pp.)
## What this is NOT
- It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are
unchanged. You can verify with the standard Hugging Face model
hash check.
- It is **not** a leaderboard submission claim. The numbers above are
on the publicly-available Video-MME mini split (300 Q, filtered to
videos available locally via the standard mini chunks).
- It is **not** a replacement for fine-tuning when you have abundant
domain data. For domain-shifted deployments (e.g. surveillance
video), training-based adaptation may be required.
## Hardware
Runs on:
| Device | Notes |
|---|---|
| Apple M4 Max / M3 Pro (MPS, β₯ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
| NVIDIA A100 / H100 (CUDA) | works; faster |
| CPU (BF16-capable) | works but slow |
VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.
## Reproducibility
All numbers in this card are reproducible from a fresh clone of this
repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
(filtered to its `videos_chunked_01.zip` mini split).
The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
**self-contained** β they have no external project dependencies beyond
the local `dw_queryframes.py` module and standard Python /
Hugging Face / PyTorch packages.
### Three-command reproduction recipe
```bash
# Install deps
pip install torch transformers pillow decord huggingface_hub pandas pyarrow
# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
python eval_videomme.py --mode stock-uniform --n-questions 300 \
--out-json stock_uniform_300q.json
# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
python eval_videomme.py --mode wild --n-questions 300 \
--out-json wild_300q.json
# 3. Combine into task-aware MCQ mode via the hybrid policy
python build_hybrid.py \
--wild-json wild_300q.json \
--stock-uniform-json stock_uniform_300q.json \
--out-json hybrid_300q.json
```
Expected results at 300 Q (greedy decoding, `do_sample=False`,
`max_pixels=262144`):
| Output | Accuracy | Ξ vs stock |
|---|---:|---:|
| `stock_uniform_300q.json` | 0.5700 | β |
| `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
| `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |
This artifact is **fully deterministic** at greedy decoding β
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
in task-aware MCQ mode.
> **Caveat β sample size and split.** The 300 Q numbers above are on
> the `videos_chunked_01.zip` mini subset, which happens to be mostly
> short clips. For full-split numbers on Video-MME mini 2700 Q
> (balanced short / medium / long), see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
> below. This release is not a leaderboard submission.
## Scope on the full Video-MME mini (2700 Q)
After the 300 Q release, the eval was extended to the full 2700 Q
split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
53.33 %, **Ξ +0.22 pp**.
This method targets short-clip, low-frame-budget video QA. The
2700 Q split is balanced across short / medium / long-form clips;
averaging across that range dilutes the gain to roughly neutral.
## Acknowledgements / Related Work
This project builds on Qwen3-VL-2B-Instruct and uses a simple
CLIP-based query-aware frame selection policy at inference time.
Query-aware and adaptive frame selection for Video-LLMs is an active
research direction. This release is an independent, simple CLIP-based
inference-time implementation focused on small-model video MCQ /
decision-style video QA under tight frame budgets.
## License
| Component | License | Source |
|---|---|---|
| This wrapper code | Apache 2.0 | this repo |
| Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
| Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |
When using or citing this work, please credit the base model:
> Built on Qwen3-VL-2B-Instruct (Apache 2.0).
> Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).
## Citation
```bibtex
@misc{dw-khottaevl-2b-queryframes-2026,
author = {Deaw},
title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
for Video MCQ on Qwen3-VL-2B-Instruct},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
}
@misc{qwen3vl2025,
title = {Qwen3-VL: Multilingual Vision-Language Models},
author = {Qwen Team},
year = {2025},
}
@inproceedings{radford2021clip,
title = {Learning Transferable Visual Models From Natural Language Supervision},
author = {Radford, Alec and Kim, Jong Wook and others},
booktitle = {ICML},
year = {2021},
}
@misc{videomme2024,
title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
of Multi-modal LLMs in Video Analysis},
author = {Fu, Chaoyou and others},
year = {2024},
}
```
## Author
**Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) β independent
ML practitioner. Personal research release.
Issues / questions: open an issue on the model repo.
|