commandeaw commited on
Commit
84c8a9d
·
verified ·
1 Parent(s): 9bf418b

Initial release: DW-KhotTaeVL-2B-QueryFrames v1.0

Browse files

Query-aware frame selection wrapper for Qwen3-VL-2B-Instruct.
Wild mode: 64.3% on Video-MME mini 300Q (+7.3pp vs stock 57.0%).
Benchmark mode: 66.3% (+9.3pp), 12/12 task buckets non-negative.
Zero trainable parameters, no model weights modified.

Built on Qwen/Qwen3-VL-2B-Instruct (Apache 2.0).
Frame scorer: openai/clip-vit-large-patch14 (MIT).
Author: Deaw (HF: @commandeaw ).

Files changed (7) hide show
  1. LICENSE +17 -0
  2. NOTICE +39 -0
  3. README.md +272 -0
  4. build_hybrid.py +160 -0
  5. dw_queryframes.py +223 -0
  6. eval_videomme.py +233 -0
  7. example_usage.py +59 -0
LICENSE ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ Copyright 2026 Deaw (HF: @commandeaw)
6
+
7
+ Licensed under the Apache License, Version 2.0 (the "License");
8
+ you may not use this file except in compliance with the License.
9
+ You may obtain a copy of the License at
10
+
11
+ http://www.apache.org/licenses/LICENSE-2.0
12
+
13
+ Unless required by applicable law or agreed to in writing, software
14
+ distributed under the License is distributed on an "AS IS" BASIS,
15
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ See the License for the specific language governing permissions and
17
+ limitations under the License.
NOTICE ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ DW-KhotTaeVL-2B-QueryFrames
2
+ ============================
3
+
4
+ Copyright 2026 Deaw (HF: @commandeaw)
5
+
6
+ This product is released by Deaw under the Apache License,
7
+ Version 2.0. Personal research project, not affiliated with any
8
+ commercial entity.
9
+
10
+ ----
11
+
12
+ This product builds on the following third-party components:
13
+
14
+ 1. Qwen3-VL-2B-Instruct
15
+ Copyright Alibaba Cloud / Qwen Team
16
+ Licensed under the Apache License, Version 2.0
17
+ https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
18
+
19
+ Per the Apache 2.0 license, the base model weights are reused
20
+ without modification by this derivative. Always credit the base
21
+ model when using DW-KhotTaeVL-2B-QueryFrames.
22
+
23
+ 2. CLIP-ViT-Large-Patch14
24
+ Copyright OpenAI
25
+ Licensed under the MIT License
26
+ https://huggingface.co/openai/clip-vit-large-patch14
27
+
28
+ Used as a query-aware frame scorer.
29
+
30
+ 3. Video-MME (evaluation only — not redistributed)
31
+ Copyright the original authors (Fu et al. 2024)
32
+ See: https://huggingface.co/datasets/lmms-lab/Video-MME
33
+
34
+ ----
35
+
36
+ NO WARRANTY
37
+
38
+ This software is provided "AS IS" without warranty of any kind.
39
+ See LICENSE for full terms.
README.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - video
7
+ - video-question-answering
8
+ - multimodal
9
+ - vision-language
10
+ - qwen3-vl
11
+ - inference-time
12
+ - frame-selection
13
+ - clip
14
+ base_model: Qwen/Qwen3-VL-2B-Instruct
15
+ pipeline_tag: video-text-to-text
16
+ library_name: transformers
17
+ ---
18
+
19
+ # DW-KhotTaeVL-2B-QueryFrames
20
+
21
+ **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
22
+
23
+ A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
24
+ for video multiple-choice question answering. **No model weights are
25
+ modified** — this method ships a CLIP-ViT-L/14-driven frame selector
26
+ plus an optional task-type-aware uniform-fallback policy as a
27
+ wrapper around the stock model.
28
+
29
+ On Video-MME mini at 8-frame budget, this recovers **56 % of the
30
+ 8-frame → 64-frame stock baseline gap with zero training, zero
31
+ parameter changes, and ~+0.4 s overhead per question**.
32
+
33
+ ## TL;DR
34
+
35
+ | Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock |
36
+ |---|---:|---:|---:|
37
+ | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
38
+ | **DW-KhotTaeVL-QueryFrames — wild mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
39
+ | **DW-KhotTaeVL-QueryFrames — benchmark mode** (task_type provided by dataset) | 0 | **66.3 %** | **+9.3 pp** |
40
+ | Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp |
41
+
42
+ **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
43
+ 0 regressions** in benchmark mode (task_type from Video-MME dataset).
44
+
45
+ ## Why it works
46
+
47
+ Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
48
+ The gap is *by definition* a frame-coverage problem (same model, same
49
+ prompt, only frame budget changes). The bottleneck is **which 8
50
+ frames you give the model**, not the model itself.
51
+
52
+ DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
53
+ question* via CLIP-ViT-L/14 cosine similarity. For two task types
54
+ where 64-frame stock does *not* outperform 8-frame stock (Object
55
+ Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
56
+ hybrid policy reverts to uniform sampling — frame coverage is not
57
+ the bottleneck for those questions, and CLIP scoring can mis-pick.
58
+
59
+ ## Pipeline
60
+
61
+ ```
62
+ For each (video, question, options[A,B,C,D]):
63
+ 1. Sample 32 uniformly-spaced candidate frames.
64
+ 2. Encode question text with CLIP-ViT-L/14 → 768-d text vector.
65
+ 3. Encode candidate frames → 768-d image vectors.
66
+ 4. Cosine similarity → pick top-8 (or uniform-8 if task is
67
+ Object Reasoning / Temporal Reasoning, when task_type is given).
68
+ 5. Sort selected 8 frames by original temporal index.
69
+ 6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
70
+ 7. Extract letter from output.
71
+ ```
72
+
73
+ ## Usage
74
+
75
+ ### Install dependencies
76
+
77
+ ```bash
78
+ pip install torch transformers pillow decord huggingface_hub
79
+ ```
80
+
81
+ ### Minimal example
82
+
83
+ ```python
84
+ from dw_queryframes import QueryFrames
85
+
86
+ fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu
87
+
88
+ result = fv.answer_mcq(
89
+ video_path="cooking.mp4",
90
+ question="What does the chef do after pouring the oil into the pot?",
91
+ options=[
92
+ "Chops fresh green herbs",
93
+ "Pours broth into the pot",
94
+ "Stirs the oil in the pot",
95
+ "Adds salt to the pot",
96
+ ],
97
+ task_type=None, # or e.g. "Action Recognition" for benchmark mode
98
+ )
99
+ print(result["pred"]) # e.g. 'B'
100
+ print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
101
+ print(result["latency_clip_s"]) # ~0.4 s
102
+ print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
103
+ ```
104
+
105
+ ### Two operating modes
106
+
107
+ | Mode | What you pass | When to use | Acc 300 Q |
108
+ |---|---|---|---:|
109
+ | **Wild** | question + options | in-the-wild deployment with unknown task taxonomy | **64.3 %** |
110
+ | **Benchmark** | + `task_type` string | benchmark eval where the dataset itself supplies the task taxonomy (Video-MME, etc.) | **66.3 %** |
111
+
112
+ Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
113
+ `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
114
+ trigger the uniform-fallback path: `"Object Reasoning"` and
115
+ `"Temporal Reasoning"`. All other task strings (or `None`) use the
116
+ query-aware path.
117
+
118
+ > **Note on benchmark mode:** the +9.3 pp / 66.3 % number is a
119
+ > *benchmark setting* — it relies on the dataset (Video-MME) supplying
120
+ > the per-question task type as part of the standard input. It is
121
+ > not achievable in deployment without that label. Wild mode (64.3 %,
122
+ > +7.3 pp) is the in-the-wild figure when no task taxonomy is given.
123
+
124
+ ## Per-task accuracy on Video-MME mini 300 Q
125
+
126
+ | Task | n | Stock 8 f | QueryFrames | Δ |
127
+ |---|---:|---:|---:|---:|
128
+ | Action Reasoning | 9 | 0.444 | 0.667 | **+0.222** ⭐ |
129
+ | Action Recognition | 45 | 0.489 | 0.644 | **+0.156** ⭐ |
130
+ | Attribute Perception | 37 | 0.730 | 0.811 | **+0.081** ⭐ |
131
+ | Counting Problem | 34 | 0.265 | 0.353 | **+0.088** ⭐ |
132
+ | Information Synopsis | 30 | 0.800 | 0.800 | +0.000 |
133
+ | OCR Problems | 23 | 0.391 | 0.609 | **+0.217** ⭐ |
134
+ | Object Reasoning | 36 | 0.722 | 0.722 | +0.000 |
135
+ | Object Recognition | 51 | 0.588 | 0.667 | **+0.078** ⭐ |
136
+ | Spatial Perception | 10 | 0.600 | 0.700 | **+0.100** ⭐ |
137
+ | Spatial Reasoning | 9 | 0.778 | 1.000 | **+0.222** ⭐ |
138
+ | Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** ⭐ |
139
+ | Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
140
+
141
+ (Benchmark mode shown — task_type provided by Video-MME dataset.
142
+ ⭐ = Δ ≥ 5 pp.)
143
+
144
+ ## What this is NOT
145
+
146
+ - It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are
147
+ unchanged. You can verify with the standard Hugging Face model
148
+ hash check.
149
+ - It is **not** a leaderboard submission claim. The numbers above are
150
+ on the publicly-available Video-MME mini split (300 Q, filtered to
151
+ videos available locally via the standard mini chunks).
152
+ - It is **not** a replacement for fine-tuning when you have abundant
153
+ domain data. For domain-shifted deployments (e.g. surveillance
154
+ video), training-based adaptation may be required.
155
+
156
+ ## Hardware
157
+
158
+ Runs on:
159
+
160
+ | Device | Notes |
161
+ |---|---|
162
+ | Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
163
+ | NVIDIA A100 / H100 (CUDA) | works; faster |
164
+ | CPU (BF16-capable) | works but slow |
165
+
166
+ VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
167
+ 8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.
168
+
169
+ ## Reproducibility
170
+
171
+ All numbers in this card are reproducible from a fresh clone of this
172
+ repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
173
+ (filtered to its `videos_chunked_01.zip` mini split).
174
+
175
+ The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
176
+ **self-contained** — they have no external project dependencies beyond
177
+ the local `dw_queryframes.py` module and standard Python /
178
+ Hugging Face / PyTorch packages.
179
+
180
+ ### Three-command reproduction recipe
181
+
182
+ ```bash
183
+ # Install deps
184
+ pip install torch transformers pillow decord huggingface_hub pandas pyarrow
185
+
186
+ # 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
187
+ python eval_videomme.py --mode stock-uniform --n-questions 300 \
188
+ --out-json stock_uniform_300q.json
189
+
190
+ # 2. Reproduce wild-mode QA frames (writes wild_300q.json)
191
+ python eval_videomme.py --mode wild --n-questions 300 \
192
+ --out-json wild_300q.json
193
+
194
+ # 3. Combine into benchmark mode via the hybrid policy
195
+ python build_hybrid.py \
196
+ --wild-json wild_300q.json \
197
+ --stock-uniform-json stock_uniform_300q.json \
198
+ --out-json hybrid_300q.json
199
+ ```
200
+
201
+ Expected results at 300 Q (greedy decoding, `do_sample=False`,
202
+ `max_pixels=262144`):
203
+
204
+ | Output | Accuracy | Δ vs stock |
205
+ |---|---:|---:|
206
+ | `stock_uniform_300q.json` | 0.5700 | — |
207
+ | `wild_300q.json` (wild mode) | 0.6433 | +7.3 pp |
208
+ | `hybrid_300q.json` (benchmark mode) | 0.6633 | +9.3 pp |
209
+
210
+ This artifact is **fully deterministic** at greedy decoding —
211
+ re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
212
+ in benchmark mode.
213
+
214
+ > **Caveat — sample size and split.** All numbers above are on the
215
+ > Video-MME *mini* split (the 300 questions whose videos ship in
216
+ > `videos_chunked_01.zip`). They are **not** the full 2700-question
217
+ > Video-MME benchmark and are not a leaderboard submission. A full-
218
+ > benchmark eval is on the future-work list.
219
+
220
+ ## License
221
+
222
+ | Component | License | Source |
223
+ |---|---|---|
224
+ | This wrapper code | Apache 2.0 | this repo |
225
+ | Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
226
+ | Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
227
+ | Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |
228
+
229
+ When using or citing this work, please credit the base model:
230
+
231
+ > Built on Qwen3-VL-2B-Instruct (Apache 2.0).
232
+ > Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).
233
+
234
+ ## Citation
235
+
236
+ ```bibtex
237
+ @misc{dw-khottaevl-2b-queryframes-2026,
238
+ author = {Deaw},
239
+ title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
240
+ for Video MCQ on Qwen3-VL-2B-Instruct},
241
+ year = {2026},
242
+ publisher = {Hugging Face},
243
+ url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
244
+ }
245
+
246
+ @misc{qwen3vl2025,
247
+ title = {Qwen3-VL: Multilingual Vision-Language Models},
248
+ author = {Qwen Team},
249
+ year = {2025},
250
+ }
251
+
252
+ @inproceedings{radford2021clip,
253
+ title = {Learning Transferable Visual Models From Natural Language Supervision},
254
+ author = {Radford, Alec and Kim, Jong Wook and others},
255
+ booktitle = {ICML},
256
+ year = {2021},
257
+ }
258
+
259
+ @misc{videomme2024,
260
+ title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
261
+ of Multi-modal LLMs in Video Analysis},
262
+ author = {Fu, Chaoyou and others},
263
+ year = {2024},
264
+ }
265
+ ```
266
+
267
+ ## Author
268
+
269
+ **Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) — independent
270
+ ML practitioner. Personal research release.
271
+
272
+ Issues / questions: open an issue on the model repo.
build_hybrid.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Standalone benchmark-mode hybrid policy builder.
2
+
3
+ Combines two eval JSONs (wild-mode QA and stock-uniform-8f) by selecting,
4
+ per question, whichever prediction the policy says to use:
5
+
6
+ - If task_type ∈ {Object Reasoning, Temporal Reasoning} → take stock-uniform pred
7
+ (these are tasks where Video-MME 64f stock does NOT outperform 8f stock,
8
+ so query-aware frame selection cannot help).
9
+ - Else → take wild-mode (query-aware) pred.
10
+
11
+ This is a pure post-hoc combination of two prediction sets — it runs no
12
+ inference, takes no GPU. The output JSON has the same shape as the
13
+ eval JSONs, with an added ``policy_source`` field per result row.
14
+
15
+ Usage::
16
+
17
+ python eval_videomme.py --mode wild --n-questions 300 \\
18
+ --out-json wild_300q.json
19
+ python eval_videomme.py --mode stock-uniform --n-questions 300 \\
20
+ --out-json stock_uniform_300q.json
21
+ python build_hybrid.py \\
22
+ --wild-json wild_300q.json \\
23
+ --stock-uniform-json stock_uniform_300q.json \\
24
+ --out-json hybrid_300q.json
25
+ """
26
+ from __future__ import annotations
27
+
28
+ import argparse
29
+ import json
30
+ from collections import defaultdict
31
+ from pathlib import Path
32
+
33
+
34
+ # Tasks where Video-MME stock-64f does NOT outperform stock-8f on the
35
+ # 300Q mini split (measured: Object Reasoning Δ -0.083, Temporal
36
+ # Reasoning Δ +0.000). For these tasks frame coverage is not the
37
+ # bottleneck, so the hybrid policy reverts to uniform sampling.
38
+ NO_FRAME_GAIN_TASKS = frozenset({"Object Reasoning", "Temporal Reasoning"})
39
+
40
+
41
+ def load_eval(path: str | Path) -> tuple[dict, list[dict]]:
42
+ """Read a Video-MME eval JSON. Returns (summary, results)."""
43
+ d = json.loads(Path(path).read_text())
44
+ return d.get("summary", {}), d.get("results", [])
45
+
46
+
47
+ def main() -> int:
48
+ ap = argparse.ArgumentParser()
49
+ ap.add_argument("--wild-json", required=True,
50
+ help="path to wild-mode eval JSON (QA frames). "
51
+ "Produced by `eval_videomme.py --mode wild`.")
52
+ ap.add_argument("--stock-uniform-json", required=True,
53
+ help="path to stock-uniform-8f eval JSON. "
54
+ "Produced by `eval_videomme.py --mode stock-uniform`.")
55
+ ap.add_argument("--out-json", required=True,
56
+ help="output hybrid JSON path")
57
+ args = ap.parse_args()
58
+
59
+ wild_summary, wild_results = load_eval(args.wild_json)
60
+ stk_summary, stk_results = load_eval(args.stock_uniform_json)
61
+
62
+ wild_by = {r["index"]: r for r in wild_results}
63
+ stk_by = {r["index"]: r for r in stk_results}
64
+ common = sorted(set(wild_by) & set(stk_by))
65
+
66
+ if not common:
67
+ raise SystemExit(
68
+ "[hybrid] no overlapping question indices between the two "
69
+ "eval JSONs — make sure both runs used the same n_questions "
70
+ "and chunks.")
71
+
72
+ if len(common) != len(wild_by) or len(common) != len(stk_by):
73
+ print(f"[hybrid] WARN: wild={len(wild_by)} stock-uniform={len(stk_by)} "
74
+ f"overlap={len(common)}; computing on overlap only.")
75
+
76
+ hybrid_results = []
77
+ src_count = {"query_aware": 0, "uniform_fallback": 0}
78
+ for i in common:
79
+ w, s = wild_by[i], stk_by[i]
80
+ task = w.get("task_type", "")
81
+ use_uniform = task in NO_FRAME_GAIN_TASKS
82
+ chosen = s if use_uniform else w
83
+ src_count["uniform_fallback" if use_uniform else "query_aware"] += 1
84
+ hybrid_results.append({
85
+ "index": i,
86
+ "videoID": w.get("videoID"),
87
+ "task_type": task,
88
+ "gold": w.get("gold"),
89
+ "pred": chosen.get("pred"),
90
+ "correct": chosen.get("correct"),
91
+ "policy_source": ("uniform_fallback" if use_uniform else "query_aware"),
92
+ })
93
+
94
+ n = len(hybrid_results)
95
+ correct = sum(1 for r in hybrid_results if r["correct"])
96
+ acc = correct / n if n else 0.0
97
+ qa_acc = sum(1 for i in common if wild_by[i]["correct"]) / len(common)
98
+ sk_acc = sum(1 for i in common if stk_by[i]["correct"]) / len(common)
99
+
100
+ summary = {
101
+ "tag": "benchmark_mode_hybrid",
102
+ "policy": ("uniform-fallback for tasks where stock-64f does not "
103
+ "exceed stock-8f (Object Reasoning, Temporal Reasoning); "
104
+ "query-aware otherwise"),
105
+ "no_frame_gain_tasks": sorted(NO_FRAME_GAIN_TASKS),
106
+ "n_questions": n,
107
+ "accuracy": round(acc, 4),
108
+ "wild_accuracy": round(qa_acc, 4),
109
+ "stock_uniform_accuracy": round(sk_acc, 4),
110
+ "delta_hybrid_vs_stock_uniform": round(acc - sk_acc, 4),
111
+ "delta_hybrid_vs_wild": round(acc - qa_acc, 4),
112
+ "policy_source_counts": src_count,
113
+ }
114
+
115
+ out_path = Path(args.out_json)
116
+ out_path.parent.mkdir(parents=True, exist_ok=True)
117
+ out_path.write_text(json.dumps(
118
+ {"summary": summary, "results": hybrid_results},
119
+ indent=2, ensure_ascii=False))
120
+ print(f"[hybrid] wrote {out_path}")
121
+ print(f"[hybrid] hybrid acc = {acc:.4f} "
122
+ f"(wild {qa_acc:.4f}, stock-uniform {sk_acc:.4f})")
123
+ print(f"[hybrid] Δ vs stock = {acc-sk_acc:+.4f} "
124
+ f"Δ vs wild = {acc-qa_acc:+.4f}")
125
+ print(f"[hybrid] policy: query_aware={src_count['query_aware']} "
126
+ f"uniform_fallback={src_count['uniform_fallback']}")
127
+
128
+ # Per-task breakdown for transparency.
129
+ by_task = defaultdict(lambda: [0, 0])
130
+ by_task_w = defaultdict(lambda: [0, 0])
131
+ by_task_s = defaultdict(lambda: [0, 0])
132
+ for r in hybrid_results:
133
+ t = r["task_type"]
134
+ by_task[t][1] += 1
135
+ by_task[t][0] += int(r["correct"])
136
+ for r in wild_results:
137
+ t = r.get("task_type", "")
138
+ by_task_w[t][1] += 1
139
+ by_task_w[t][0] += int(r["correct"])
140
+ for r in stk_results:
141
+ t = r.get("task_type", "")
142
+ by_task_s[t][1] += 1
143
+ by_task_s[t][0] += int(r["correct"])
144
+
145
+ print(f"\n=== per-task (n / stock-uniform / wild / hybrid / Δ_hyb_vs_stock) ===")
146
+ for t in sorted(by_task):
147
+ n_t = by_task[t][1]
148
+ s_acc = by_task_s[t][0]/by_task_s[t][1] if by_task_s[t][1] else 0
149
+ w_acc = by_task_w[t][0]/by_task_w[t][1] if by_task_w[t][1] else 0
150
+ h_acc = by_task[t][0]/n_t if n_t else 0
151
+ d = h_acc - s_acc
152
+ flag = " ⭐" if d >= 0.05 else (" ⚠️" if d <= -0.05 else "")
153
+ print(f" {t:<25s} n={n_t:>3d} s={s_acc:.3f} w={w_acc:.3f} "
154
+ f"h={h_acc:.3f} Δ_hyb_vs_s={d:+.3f}{flag}")
155
+ return 0
156
+
157
+
158
+ if __name__ == "__main__":
159
+ import sys
160
+ sys.exit(main())
dw_queryframes.py ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """DW-KhotTaeVL-2B-QueryFrames — query-aware frame selection for video MCQ.
2
+
3
+ Single-file inference module. Wraps stock Qwen3-VL-2B-Instruct with a
4
+ CLIP-ViT-L/14 query-aware frame selector and an optional task-type-aware
5
+ uniform-fallback policy.
6
+
7
+ Usage::
8
+
9
+ from dw_queryframes import QueryFrames
10
+ fv = QueryFrames(device="mps")
11
+ answer = fv.answer_mcq(
12
+ video_path="cooking.mp4",
13
+ question="What does the chef do after pouring the oil?",
14
+ options=["Stirs the oil", "Adds salt", "Pours broth", "Chops herbs"],
15
+ task_type=None, # or "Action Recognition" etc. for hybrid mode
16
+ )
17
+
18
+ License: Apache 2.0 (this code)
19
+ Copyright 2026 Deaw (HF: @commandeaw)
20
+ Base model: Qwen3-VL-2B-Instruct (Apache 2.0)
21
+ Frame scorer: openai/clip-vit-large-patch14 (MIT)
22
+
23
+ Always credit Qwen3-VL-Instruct as the base when using this work.
24
+ """
25
+ from __future__ import annotations
26
+
27
+ import re
28
+ import os
29
+ from pathlib import Path
30
+ from typing import Optional
31
+
32
+ import torch
33
+ import torch.nn.functional as F
34
+ from PIL import Image
35
+
36
+
37
+ # Tasks where stock-64f does NOT outperform stock-8f on Video-MME mini
38
+ # (measured: Object Reasoning Δ -0.083, Temporal Reasoning Δ +0.000).
39
+ # For these tasks, frame-coverage is not the bottleneck; uniform sampling
40
+ # is at least as good as query-aware. The hybrid policy uses uniform
41
+ # selection for these task types when a label is provided.
42
+ NO_FRAME_GAIN_TASKS = frozenset({"Object Reasoning", "Temporal Reasoning"})
43
+
44
+
45
+ PROMPT_TEMPLATE = (
46
+ "Select the best answer based on the video.\n\n"
47
+ "Question: {question}\n"
48
+ "Options:\n{options}\n"
49
+ "Answer with only the letter."
50
+ )
51
+
52
+ LETTER_RE = re.compile(r"\b([ABCD])\b", re.IGNORECASE)
53
+ ANSWER_LINE_RE = re.compile(r"Answer:\s*([ABCD])\b", re.IGNORECASE)
54
+
55
+
56
+ class QueryFrames:
57
+ """Query-aware frame selection over stock Qwen3-VL-2B-Instruct."""
58
+
59
+ def __init__(
60
+ self,
61
+ base_model: str = "Qwen/Qwen3-VL-2B-Instruct",
62
+ clip_model: str = "openai/clip-vit-large-patch14",
63
+ device: str = "auto",
64
+ max_pixels: int = 262_144,
65
+ max_new_tokens: int = 8,
66
+ n_frames: int = 8,
67
+ n_candidates: int = 32,
68
+ ):
69
+ os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
70
+ self.device = self._resolve_device(device)
71
+ self.n_frames = n_frames
72
+ self.n_candidates = n_candidates
73
+ self.max_new_tokens = max_new_tokens
74
+
75
+ from transformers import (
76
+ AutoProcessor, Qwen3VLForConditionalGeneration,
77
+ CLIPModel, CLIPProcessor,
78
+ )
79
+ self.qwen_processor = AutoProcessor.from_pretrained(base_model, max_pixels=max_pixels)
80
+ self.qwen_model = Qwen3VLForConditionalGeneration.from_pretrained(
81
+ base_model, dtype=torch.bfloat16,
82
+ ).to(self.device).eval()
83
+ self.clip_model = CLIPModel.from_pretrained(
84
+ clip_model, torch_dtype=torch.float32,
85
+ ).to(self.device).eval()
86
+ self.clip_processor = CLIPProcessor.from_pretrained(clip_model)
87
+
88
+ @staticmethod
89
+ def _resolve_device(device: str) -> str:
90
+ if device == "auto":
91
+ if torch.backends.mps.is_available():
92
+ return "mps"
93
+ if torch.cuda.is_available():
94
+ return "cuda"
95
+ return "cpu"
96
+ return device
97
+
98
+ def sample_uniform_candidates(self, video_path: str | Path) -> list[Image.Image]:
99
+ """Sample ``n_candidates`` uniformly-spaced frames as PIL images."""
100
+ import decord
101
+ vid = decord.VideoReader(str(video_path))
102
+ total = len(vid)
103
+ step = total / (self.n_candidates + 1)
104
+ indices = [int((i + 1) * step) for i in range(self.n_candidates)]
105
+ return [Image.fromarray(vid[i].asnumpy()) for i in indices]
106
+
107
+ def select_frames(
108
+ self,
109
+ candidates: list[Image.Image],
110
+ question: str,
111
+ ) -> list[Image.Image]:
112
+ """Return ``n_frames`` images: top-K by CLIP similarity to question,
113
+ sorted by original temporal index (preserving sequence)."""
114
+ inputs = self.clip_processor(
115
+ text=[question], images=candidates,
116
+ return_tensors="pt", padding=True, truncation=True,
117
+ )
118
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
119
+ with torch.inference_mode():
120
+ text_emb = self.clip_model.get_text_features(
121
+ input_ids=inputs["input_ids"],
122
+ attention_mask=inputs["attention_mask"],
123
+ )
124
+ image_embs = self.clip_model.get_image_features(
125
+ pixel_values=inputs["pixel_values"]
126
+ )
127
+ text_emb = F.normalize(text_emb, dim=-1)
128
+ image_embs = F.normalize(image_embs, dim=-1)
129
+ sims = (text_emb @ image_embs.T).squeeze(0).float().cpu()
130
+ topk = sims.topk(self.n_frames).indices.tolist()
131
+ topk_sorted = sorted(topk)
132
+ return [candidates[i] for i in topk_sorted]
133
+
134
+ def select_uniform(self, candidates: list[Image.Image]) -> list[Image.Image]:
135
+ """Return ``n_frames`` images sampled uniformly from candidates."""
136
+ step = len(candidates) / self.n_frames
137
+ idx = [int((k + 0.5) * step) for k in range(self.n_frames)]
138
+ idx = [min(i, len(candidates) - 1) for i in idx]
139
+ return [candidates[i] for i in idx]
140
+
141
+ def answer_mcq(
142
+ self,
143
+ video_path: str | Path,
144
+ question: str,
145
+ options: list[str],
146
+ task_type: Optional[str] = None,
147
+ ) -> dict:
148
+ """Answer one MCQ question on a video.
149
+
150
+ Args:
151
+ video_path: path to .mp4 (or any decord-readable video)
152
+ question: string question (no options)
153
+ options: list of 4 option strings (will be lettered A-D)
154
+ task_type: optional task category. If provided and matches
155
+ a known no-frame-gain task, falls back to
156
+ uniform sampling for collision-safe behavior.
157
+
158
+ Returns:
159
+ dict with keys: pred (letter), raw (model output),
160
+ frames_used ("query_aware" | "uniform_fallback"),
161
+ n_candidates, latency_clip_s, latency_gen_s.
162
+ """
163
+ import time
164
+ candidates = self.sample_uniform_candidates(video_path)
165
+
166
+ # Decide policy.
167
+ use_uniform = task_type in NO_FRAME_GAIN_TASKS
168
+ t1 = time.time()
169
+ if use_uniform:
170
+ frames = self.select_uniform(candidates)
171
+ else:
172
+ frames = self.select_frames(candidates, question)
173
+ clip_dt = time.time() - t1
174
+
175
+ # Build Qwen prompt and run inference.
176
+ opts_text = "\n".join(f"{chr(65+i)}. {str(o).strip()}"
177
+ for i, o in enumerate(options))
178
+ prompt = PROMPT_TEMPLATE.format(question=question, options=opts_text)
179
+ messages = [{"role": "user", "content":
180
+ [{"type": "image"} for _ in frames]
181
+ + [{"type": "text", "text": prompt}]}]
182
+ text_in = self.qwen_processor.apply_chat_template(
183
+ messages, tokenize=False, add_generation_prompt=True,
184
+ )
185
+ inputs = self.qwen_processor(
186
+ text=[text_in], images=frames,
187
+ return_tensors="pt", padding=True,
188
+ )
189
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
190
+ t2 = time.time()
191
+ with torch.inference_mode():
192
+ out_ids = self.qwen_model.generate(
193
+ **inputs,
194
+ max_new_tokens=self.max_new_tokens,
195
+ do_sample=False,
196
+ temperature=1.0,
197
+ )
198
+ gen_dt = time.time() - t2
199
+ new_tokens = out_ids[0, inputs["input_ids"].shape[1]:]
200
+ raw = self.qwen_processor.tokenizer.decode(
201
+ new_tokens, skip_special_tokens=True,
202
+ )
203
+ pred = self._extract_letter(raw)
204
+ return {
205
+ "pred": pred,
206
+ "raw": raw,
207
+ "frames_used": "uniform_fallback" if use_uniform else "query_aware",
208
+ "n_candidates": self.n_candidates,
209
+ "latency_clip_s": round(clip_dt, 3),
210
+ "latency_gen_s": round(gen_dt, 3),
211
+ }
212
+
213
+ @staticmethod
214
+ def _extract_letter(text: str) -> Optional[str]:
215
+ s = text or ""
216
+ m = ANSWER_LINE_RE.search(s)
217
+ if m:
218
+ return m.group(1).upper()
219
+ m = LETTER_RE.search(s)
220
+ return m.group(1).upper() if m else None
221
+
222
+
223
+ __all__ = ["QueryFrames", "NO_FRAME_GAIN_TASKS"]
eval_videomme.py ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Standalone Video-MME mini eval for DW-KhotTaeVL-2B-QueryFrames.
2
+
3
+ This script reproduces the wild-mode QA-frame numbers reported in the
4
+ model card. It is fully self-contained — only depends on the
5
+ `dw_queryframes.py` module shipped in this same directory plus
6
+ publicly-available datasets / models from Hugging Face.
7
+
8
+ Usage::
9
+
10
+ pip install torch transformers pillow decord huggingface_hub pandas pyarrow
11
+
12
+ # Wild mode (query-aware frame selection)
13
+ python eval_videomme.py --mode wild --n-questions 50
14
+
15
+ # Stock baseline (uniform 8 frames; matches the stock numbers
16
+ # in the model card)
17
+ python eval_videomme.py --mode stock-uniform --n-questions 50
18
+
19
+ For benchmark-mode evaluation (uses Video-MME's own task_type label
20
+ to pick uniform-fallback for Object/Temporal Reasoning), run both
21
+ modes above then combine via ``build_hybrid.py``.
22
+
23
+ Outputs JSON with ``summary`` + ``results`` keys.
24
+ """
25
+ from __future__ import annotations
26
+
27
+ import argparse
28
+ import json
29
+ import os
30
+ import re
31
+ import sys
32
+ import time
33
+ import zipfile
34
+ from pathlib import Path
35
+
36
+ import pandas as pd
37
+ from huggingface_hub import hf_hub_download
38
+ from PIL import Image
39
+
40
+
41
+ # ---------------------------------------------------------------------------
42
+ # Public Video-MME mini assets (lmms-lab/Video-MME on Hugging Face).
43
+ # ---------------------------------------------------------------------------
44
+ REPO_ID = "lmms-lab/Video-MME"
45
+ REPO_TYPE = "dataset"
46
+ DEFAULT_CHUNKS = ["videos_chunked_01.zip"]
47
+ PARQUET_NAME = "videomme/test-00000-of-00001.parquet"
48
+
49
+ # Cache lives next to this script so a fresh ``git clone`` of the HF
50
+ # repo can reproduce results without touching the user's home directory.
51
+ CACHE_DIR = Path(__file__).resolve().parent / "cache" / "videomme_mini"
52
+ CACHE_DIR.mkdir(parents=True, exist_ok=True)
53
+
54
+ PROMPT_TEMPLATE = (
55
+ "This is a representative frame from a video.\n"
56
+ "Select the best answer based on the video.\n\n"
57
+ "Question: {question}\n"
58
+ "Options:\n{options}\n"
59
+ "Answer with only the letter."
60
+ )
61
+
62
+ ANSWER_RE = re.compile(r"\b([ABCD])\b", re.IGNORECASE)
63
+ ALPTD_ANSWER_RE = re.compile(r"Answer:\s*([ABCD])\b", re.IGNORECASE)
64
+
65
+
66
+ # ---------------------------------------------------------------------------
67
+ # Asset management — fetch + unzip into CACHE_DIR.
68
+ # ---------------------------------------------------------------------------
69
+ def download_assets(chunks: list[str]) -> tuple[Path, list[Path]]:
70
+ print(f"[eval] ensuring {PARQUET_NAME} ...")
71
+ pq_path = Path(hf_hub_download(
72
+ repo_id=REPO_ID, repo_type=REPO_TYPE, filename=PARQUET_NAME,
73
+ cache_dir=str(CACHE_DIR / "hf"),
74
+ ))
75
+ zip_paths: list[Path] = []
76
+ for name in chunks:
77
+ zp = Path(hf_hub_download(
78
+ repo_id=REPO_ID, repo_type=REPO_TYPE, filename=name,
79
+ cache_dir=str(CACHE_DIR / "hf"),
80
+ ))
81
+ zip_paths.append(zp)
82
+ return pq_path, zip_paths
83
+
84
+
85
+ def unzip_chunks(zip_paths: list[Path]) -> Path:
86
+ video_dir = CACHE_DIR / "video"
87
+ video_dir.mkdir(parents=True, exist_ok=True)
88
+ for zp in zip_paths:
89
+ existing = {p.stem for p in video_dir.glob("*.mp4")}
90
+ with zipfile.ZipFile(zp, "r") as zf:
91
+ to_extract = [
92
+ m for m in zf.namelist()
93
+ if m.endswith(".mp4") and Path(m).stem not in existing
94
+ ]
95
+ if to_extract:
96
+ print(f"[eval] extracting {len(to_extract)} mp4s from {zp.name}")
97
+ for m in to_extract:
98
+ with zf.open(m) as src, open(video_dir / Path(m).name, "wb") as dst:
99
+ dst.write(src.read())
100
+ return video_dir
101
+
102
+
103
+ def load_questions(pq_path: Path, video_dir: Path, limit: int) -> pd.DataFrame:
104
+ df = pd.read_parquet(pq_path)
105
+ ids = {p.stem for p in video_dir.glob("*.mp4")}
106
+ df = df[df["videoID"].isin(ids)].reset_index(drop=True)
107
+ if limit > 0 and len(df) > limit:
108
+ df = df.iloc[:limit].copy()
109
+ print(f"[eval] using {len(df)} questions")
110
+ return df
111
+
112
+
113
+ def format_options(options) -> str:
114
+ return "\n".join(str(o).strip() for o in options)
115
+
116
+
117
+ def extract_letter(text: str) -> str | None:
118
+ s = text or ""
119
+ m = ALPTD_ANSWER_RE.search(s)
120
+ if m:
121
+ return m.group(1).upper()
122
+ m = ANSWER_RE.search(s)
123
+ return m.group(1).upper() if m else None
124
+
125
+
126
+ # ---------------------------------------------------------------------------
127
+ # Frame selection lives in the local QueryFrames module.
128
+ # ---------------------------------------------------------------------------
129
+ sys.path.insert(0, str(Path(__file__).resolve().parent))
130
+ from dw_queryframes import QueryFrames # noqa: E402
131
+
132
+
133
+ def main() -> int:
134
+ ap = argparse.ArgumentParser()
135
+ ap.add_argument("--base", default="Qwen/Qwen3-VL-2B-Instruct")
136
+ ap.add_argument("--clip-model", default="openai/clip-vit-large-patch14")
137
+ ap.add_argument("--mode", choices=["wild", "stock-uniform"],
138
+ default="wild",
139
+ help="'wild' = query-aware (top-K of N candidates); "
140
+ "'stock-uniform' = stock baseline (uniform 8 frames)")
141
+ ap.add_argument("--tag", default="")
142
+ ap.add_argument("--n-questions", type=int, default=50)
143
+ ap.add_argument("--n-frames", type=int, default=8)
144
+ ap.add_argument("--n-candidates", type=int, default=32)
145
+ ap.add_argument("--max-pixels", type=int, default=262144)
146
+ ap.add_argument("--max-new-tokens", type=int, default=8)
147
+ ap.add_argument("--out-json", default=None,
148
+ help="output JSON path (auto-named if omitted)")
149
+ ap.add_argument("--chunks", nargs="+", default=DEFAULT_CHUNKS)
150
+ args = ap.parse_args()
151
+
152
+ pq_path, zip_paths = download_assets(args.chunks)
153
+ video_dir = unzip_chunks(zip_paths)
154
+ df = load_questions(pq_path, video_dir, args.n_questions)
155
+
156
+ os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
157
+
158
+ fv = QueryFrames(
159
+ base_model=args.base,
160
+ clip_model=args.clip_model,
161
+ device="auto",
162
+ max_pixels=args.max_pixels,
163
+ max_new_tokens=args.max_new_tokens,
164
+ n_frames=args.n_frames,
165
+ n_candidates=args.n_candidates,
166
+ )
167
+
168
+ results = []
169
+ correct = 0
170
+ t0 = time.time()
171
+ for i, row in df.iterrows():
172
+ video_path = video_dir / f"{row['videoID']}.mp4"
173
+
174
+ # Wild mode = query-aware (task_type=None lets QA path run).
175
+ # Stock-uniform = pass a known no-frame-gain task name to force
176
+ # the uniform-fallback path (matches stock 8f
177
+ # baseline behavior).
178
+ forced_uniform = (args.mode == "stock-uniform")
179
+ out = fv.answer_mcq(
180
+ video_path=video_path,
181
+ question=row["question"],
182
+ options=list(row["options"]),
183
+ task_type=("Object Reasoning" if forced_uniform else None),
184
+ )
185
+ gold = row["answer"].strip().upper()
186
+ ok = out["pred"] == gold
187
+ correct += int(ok)
188
+ results.append({
189
+ "index": int(i),
190
+ "videoID": row["videoID"],
191
+ "task_type": row.get("task_type", ""),
192
+ "gold": gold,
193
+ "pred": out["pred"],
194
+ "raw": out["raw"][:200],
195
+ "frames_used": out["frames_used"],
196
+ "latency_clip_s": out["latency_clip_s"],
197
+ "latency_gen_s": out["latency_gen_s"],
198
+ "correct": ok,
199
+ })
200
+ run = correct / (i + 1)
201
+ print(f"[eval] [{i+1}/{len(df)}] gold={gold} pred={out['pred']} "
202
+ f"acc_so_far={run:.3f} clip={out['latency_clip_s']}s "
203
+ f"gen={out['latency_gen_s']}s", flush=True)
204
+
205
+ n = len(results)
206
+ acc = correct / n if n else 0.0
207
+ summary = {
208
+ "model_base": args.base,
209
+ "clip_model": args.clip_model,
210
+ "mode": args.mode,
211
+ "tag": args.tag,
212
+ "n_questions": n,
213
+ "n_frames": args.n_frames,
214
+ "n_candidates": args.n_candidates,
215
+ "max_pixels": args.max_pixels,
216
+ "max_new_tokens": args.max_new_tokens,
217
+ "accuracy": round(acc, 4),
218
+ "wall_time_s": round(time.time() - t0, 1),
219
+ }
220
+
221
+ out_path = args.out_json
222
+ if out_path is None:
223
+ tag = (args.tag or args.mode)
224
+ out_path = str(CACHE_DIR.parent / f"eval_{tag}_{n}q.json")
225
+ Path(out_path).parent.mkdir(parents=True, exist_ok=True)
226
+ Path(out_path).write_text(json.dumps(
227
+ {"summary": summary, "results": results}, indent=2))
228
+ print(f"\n[eval] mode={args.mode} acc={acc:.4f} ({correct}/{n}) saved {out_path}")
229
+ return 0
230
+
231
+
232
+ if __name__ == "__main__":
233
+ sys.exit(main())
example_usage.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Example: run DW-KhotTaeVL-2B-QueryFrames on a single video MCQ.
2
+
3
+ Requirements::
4
+
5
+ pip install torch transformers pillow decord huggingface_hub
6
+
7
+ This script loads the QueryFrames wrapper, samples 32 candidate frames
8
+ from the input video, picks the 8 most relevant to the question via
9
+ CLIP-ViT-L/14, and answers via stock Qwen3-VL-2B-Instruct.
10
+ """
11
+ from dw_queryframes import QueryFrames
12
+
13
+
14
+ def main() -> None:
15
+ fv = QueryFrames(
16
+ base_model="Qwen/Qwen3-VL-2B-Instruct",
17
+ clip_model="openai/clip-vit-large-patch14",
18
+ device="auto",
19
+ n_frames=8,
20
+ n_candidates=32,
21
+ )
22
+
23
+ # Wild-mode example (no task taxonomy known).
24
+ result = fv.answer_mcq(
25
+ video_path="example.mp4",
26
+ question="What does the chef do after pouring the oil into the pot?",
27
+ options=[
28
+ "Chops fresh green herbs",
29
+ "Pours broth into the pot",
30
+ "Stirs the oil in the pot",
31
+ "Adds salt to the pot",
32
+ ],
33
+ )
34
+ print("[wild mode]")
35
+ print(f" pred : {result['pred']}")
36
+ print(f" raw output : {result['raw']!r}")
37
+ print(f" frames used : {result['frames_used']}")
38
+ print(f" CLIP latency : {result['latency_clip_s']} s")
39
+ print(f" GEN latency : {result['latency_gen_s']} s")
40
+
41
+ # Task-aware example (when task taxonomy is provided, e.g. Video-MME).
42
+ result2 = fv.answer_mcq(
43
+ video_path="example.mp4",
44
+ question="What is happening to the cabbage in the frying pan?",
45
+ options=[
46
+ "It is being stirred",
47
+ "It is being chopped",
48
+ "It is being served",
49
+ "It is being washed",
50
+ ],
51
+ task_type="Object Reasoning", # → uniform-fallback path
52
+ )
53
+ print("\n[task-aware mode]")
54
+ print(f" pred : {result2['pred']}")
55
+ print(f" frames used : {result2['frames_used']}") # 'uniform_fallback'
56
+
57
+
58
+ if __name__ == "__main__":
59
+ main()