Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wording hotfix: scope clarity + MCQ-mode terminology
Browse filesFollowing audit, retire 'wild mode' / 'in-the-wild deployment' wording
in favor of:
- 'MCQ mode (no task_type)' for the default 64.3% setting
- 'Task-aware MCQ mode' for the 66.3% setting (uses dataset task taxonomy)
Add explicit Scope section: this release evaluates query-aware frame
selection for video MCQ / decision-style QA. The selector may use the
question + answer options as its CLIP query. This is appropriate for
Video-MME-style benchmarks and operational triage workflows where the
system chooses among predefined actions/alert categories. Not an
open-ended video understanding claim.
Add Motivation note: started from CCTV/video-security R&D under tight
frame-budget constraints. The artifact is general-purpose, not a
product-specific CCTV model.
Method, numbers (64.3% / 66.3%), and Qwen3-VL attribution unchanged.
|
@@ -21,27 +21,46 @@ library_name: transformers
|
|
| 21 |
**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
|
| 22 |
|
| 23 |
A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
|
| 24 |
-
for video multiple-choice question answering. **No model
|
| 25 |
-
modified** β this method ships a CLIP-ViT-L/14-driven frame
|
| 26 |
-
plus an optional task-type-aware uniform-fallback policy as a
|
| 27 |
wrapper around the stock model.
|
| 28 |
|
| 29 |
On Video-MME mini at 8-frame budget, this recovers **~44 % of the
|
| 30 |
-
8-frame β 64-frame stock baseline gap in
|
| 31 |
-
|
| 32 |
~+0.4 s overhead per question.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
## TL;DR
|
| 35 |
|
| 36 |
| Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ vs stock |
|
| 37 |
|---|---:|---:|---:|
|
| 38 |
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
|
| 39 |
-
| **
|
| 40 |
-
| **
|
| 41 |
| Stock Qwen3-VL-2B (uniform 64 f) β ceiling | 0 | 73.7 % | +16.7 pp |
|
| 42 |
|
| 43 |
**12 of 12 task buckets non-negative; 8 strongly positive (β₯ 5 pp);
|
| 44 |
-
0 regressions** in
|
| 45 |
|
| 46 |
## Why it works
|
| 47 |
|
|
@@ -95,7 +114,7 @@ result = fv.answer_mcq(
|
|
| 95 |
"Stirs the oil in the pot",
|
| 96 |
"Adds salt to the pot",
|
| 97 |
],
|
| 98 |
-
task_type=None, # or e.g. "Action Recognition" for
|
| 99 |
)
|
| 100 |
print(result["pred"]) # e.g. 'B'
|
| 101 |
print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
|
|
@@ -105,10 +124,10 @@ print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
|
|
| 105 |
|
| 106 |
### Two operating modes
|
| 107 |
|
| 108 |
-
| Mode |
|
| 109 |
|---|---|---|---:|
|
| 110 |
-
| **
|
| 111 |
-
| **
|
| 112 |
|
| 113 |
Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
|
| 114 |
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
|
|
@@ -116,11 +135,15 @@ trigger the uniform-fallback path: `"Object Reasoning"` and
|
|
| 116 |
`"Temporal Reasoning"`. All other task strings (or `None`) use the
|
| 117 |
query-aware path.
|
| 118 |
|
| 119 |
-
> **
|
| 120 |
-
>
|
| 121 |
-
>
|
| 122 |
-
>
|
| 123 |
-
> +
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
## Per-task accuracy on Video-MME mini 300 Q
|
| 126 |
|
|
@@ -139,7 +162,7 @@ query-aware path.
|
|
| 139 |
| Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** β |
|
| 140 |
| Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
|
| 141 |
|
| 142 |
-
(
|
| 143 |
β = Ξ β₯ 5 pp.)
|
| 144 |
|
| 145 |
## What this is NOT
|
|
@@ -188,11 +211,11 @@ pip install torch transformers pillow decord huggingface_hub pandas pyarrow
|
|
| 188 |
python eval_videomme.py --mode stock-uniform --n-questions 300 \
|
| 189 |
--out-json stock_uniform_300q.json
|
| 190 |
|
| 191 |
-
# 2. Reproduce
|
| 192 |
python eval_videomme.py --mode wild --n-questions 300 \
|
| 193 |
--out-json wild_300q.json
|
| 194 |
|
| 195 |
-
# 3. Combine into
|
| 196 |
python build_hybrid.py \
|
| 197 |
--wild-json wild_300q.json \
|
| 198 |
--stock-uniform-json stock_uniform_300q.json \
|
|
@@ -205,12 +228,12 @@ Expected results at 300 Q (greedy decoding, `do_sample=False`,
|
|
| 205 |
| Output | Accuracy | Ξ vs stock |
|
| 206 |
|---|---:|---:|
|
| 207 |
| `stock_uniform_300q.json` | 0.5700 | β |
|
| 208 |
-
| `wild_300q.json` (
|
| 209 |
-
| `hybrid_300q.json` (
|
| 210 |
|
| 211 |
This artifact is **fully deterministic** at greedy decoding β
|
| 212 |
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
|
| 213 |
-
in
|
| 214 |
|
| 215 |
> **Caveat β sample size and split.** All numbers above are on the
|
| 216 |
> Video-MME *mini* split (the 300 questions whose videos ship in
|
|
@@ -224,9 +247,9 @@ This project builds on Qwen3-VL-2B-Instruct and uses a simple
|
|
| 224 |
CLIP-based query-aware frame selection policy at inference time.
|
| 225 |
|
| 226 |
Query-aware and adaptive frame selection for Video-LLMs is an active
|
| 227 |
-
research direction.
|
| 228 |
-
|
| 229 |
-
video QA
|
| 230 |
|
| 231 |
## License
|
| 232 |
|
|
|
|
| 21 |
**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
|
| 22 |
|
| 23 |
A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
|
| 24 |
+
for video multiple-choice / decision-style question answering. **No model
|
| 25 |
+
weights are modified** β this method ships a CLIP-ViT-L/14-driven frame
|
| 26 |
+
selector plus an optional task-type-aware uniform-fallback policy as a
|
| 27 |
wrapper around the stock model.
|
| 28 |
|
| 29 |
On Video-MME mini at 8-frame budget, this recovers **~44 % of the
|
| 30 |
+
8-frame β 64-frame stock baseline gap in MCQ mode, and ~56 % in
|
| 31 |
+
task-aware MCQ mode**, with zero training, zero parameter changes, and
|
| 32 |
~+0.4 s overhead per question.
|
| 33 |
|
| 34 |
+
## Scope
|
| 35 |
+
|
| 36 |
+
This release evaluates query-aware frame selection in a video
|
| 37 |
+
multiple-choice / decision-style QA setting. The selector may use
|
| 38 |
+
both the question text and the answer options as its CLIP query.
|
| 39 |
+
This is appropriate for Video-MME-style MCQ benchmarks and for
|
| 40 |
+
operational triage workflows where the system chooses among
|
| 41 |
+
predefined actions or alert categories (e.g. *normal passage /
|
| 42 |
+
restricted-zone entry / staff activity / false alarm*). It should
|
| 43 |
+
**not** be read as an open-ended video-understanding benchmark claim.
|
| 44 |
+
|
| 45 |
+
## Motivation
|
| 46 |
+
|
| 47 |
+
This work started from CCTV / video-security R&D, where only a small
|
| 48 |
+
number of frames can be sent to a VLM under latency and compute
|
| 49 |
+
constraints. The released artifact is a general-purpose query-aware
|
| 50 |
+
frame selector for video MCQ / decision-style video QA β not a
|
| 51 |
+
product-specific CCTV model.
|
| 52 |
+
|
| 53 |
## TL;DR
|
| 54 |
|
| 55 |
| Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ vs stock |
|
| 56 |
|---|---:|---:|---:|
|
| 57 |
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
|
| 58 |
+
| **QueryFrames β MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
|
| 59 |
+
| **QueryFrames β Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
|
| 60 |
| Stock Qwen3-VL-2B (uniform 64 f) β ceiling | 0 | 73.7 % | +16.7 pp |
|
| 61 |
|
| 62 |
**12 of 12 task buckets non-negative; 8 strongly positive (β₯ 5 pp);
|
| 63 |
+
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
|
| 64 |
|
| 65 |
## Why it works
|
| 66 |
|
|
|
|
| 114 |
"Stirs the oil in the pot",
|
| 115 |
"Adds salt to the pot",
|
| 116 |
],
|
| 117 |
+
task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode
|
| 118 |
)
|
| 119 |
print(result["pred"]) # e.g. 'B'
|
| 120 |
print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
|
|
|
|
| 124 |
|
| 125 |
### Two operating modes
|
| 126 |
|
| 127 |
+
| Mode | Input | Use | Acc 300 Q |
|
| 128 |
|---|---|---|---:|
|
| 129 |
+
| **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
|
| 130 |
+
| **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |
|
| 131 |
|
| 132 |
Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
|
| 133 |
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
|
|
|
|
| 135 |
`"Temporal Reasoning"`. All other task strings (or `None`) use the
|
| 136 |
query-aware path.
|
| 137 |
|
| 138 |
+
> **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
|
| 139 |
+
> reported setting: it uses only the video, question, and answer
|
| 140 |
+
> options, with no task taxonomy.
|
| 141 |
+
>
|
| 142 |
+
> **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
|
| 143 |
+
> label supplied by Video-MME to route Object Reasoning and Temporal
|
| 144 |
+
> Reasoning questions to uniform sampling. This is a benchmark /
|
| 145 |
+
> controlled-workflow setting and is reported separately from default
|
| 146 |
+
> MCQ mode.
|
| 147 |
|
| 148 |
## Per-task accuracy on Video-MME mini 300 Q
|
| 149 |
|
|
|
|
| 162 |
| Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** β |
|
| 163 |
| Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
|
| 164 |
|
| 165 |
+
(Task-aware MCQ mode shown β task_type provided by Video-MME dataset.
|
| 166 |
β = Ξ β₯ 5 pp.)
|
| 167 |
|
| 168 |
## What this is NOT
|
|
|
|
| 211 |
python eval_videomme.py --mode stock-uniform --n-questions 300 \
|
| 212 |
--out-json stock_uniform_300q.json
|
| 213 |
|
| 214 |
+
# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
|
| 215 |
python eval_videomme.py --mode wild --n-questions 300 \
|
| 216 |
--out-json wild_300q.json
|
| 217 |
|
| 218 |
+
# 3. Combine into task-aware MCQ mode via the hybrid policy
|
| 219 |
python build_hybrid.py \
|
| 220 |
--wild-json wild_300q.json \
|
| 221 |
--stock-uniform-json stock_uniform_300q.json \
|
|
|
|
| 228 |
| Output | Accuracy | Ξ vs stock |
|
| 229 |
|---|---:|---:|
|
| 230 |
| `stock_uniform_300q.json` | 0.5700 | β |
|
| 231 |
+
| `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
|
| 232 |
+
| `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |
|
| 233 |
|
| 234 |
This artifact is **fully deterministic** at greedy decoding β
|
| 235 |
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
|
| 236 |
+
in task-aware MCQ mode.
|
| 237 |
|
| 238 |
> **Caveat β sample size and split.** All numbers above are on the
|
| 239 |
> Video-MME *mini* split (the 300 questions whose videos ship in
|
|
|
|
| 247 |
CLIP-based query-aware frame selection policy at inference time.
|
| 248 |
|
| 249 |
Query-aware and adaptive frame selection for Video-LLMs is an active
|
| 250 |
+
research direction. This release is an independent, simple CLIP-based
|
| 251 |
+
inference-time implementation focused on small-model video MCQ /
|
| 252 |
+
decision-style video QA under tight frame budgets.
|
| 253 |
|
| 254 |
## License
|
| 255 |
|