Wording hotfix: scope clarity + MCQ-mode terminology

Following audit, retire 'wild mode' / 'in-the-wild deployment' wording
in favor of:
- 'MCQ mode (no task_type)' for the default 64.3% setting
- 'Task-aware MCQ mode' for the 66.3% setting (uses dataset task taxonomy)

Add explicit Scope section: this release evaluates query-aware frame
selection for video MCQ / decision-style QA. The selector may use the
question + answer options as its CLIP query. This is appropriate for
Video-MME-style benchmarks and operational triage workflows where the
system chooses among predefined actions/alert categories. Not an
open-ended video understanding claim.

Add Motivation note: started from CCTV/video-security R&D under tight
frame-budget constraints. The artifact is general-purpose, not a
product-specific CCTV model.

Method, numbers (64.3% / 66.3%), and Qwen3-VL attribution unchanged.

Files changed (1) hide show

README.md +49 -26

README.md CHANGED Viewed

@@ -21,27 +21,46 @@ library_name: transformers
 **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
 A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
-for video multiple-choice question answering. **No model weights are
-modified** — this method ships a CLIP-ViT-L/14-driven frame selector
-plus an optional task-type-aware uniform-fallback policy as a
 wrapper around the stock model.
 On Video-MME mini at 8-frame budget, this recovers **~44 % of the
-8-frame → 64-frame stock baseline gap in wild mode, and ~56 % in
-benchmark mode**, with zero training, zero parameter changes, and
 ~+0.4 s overhead per question.
 ## TL;DR
 | Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock |
 |---|---:|---:|---:|
 | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
-| **DW-KhotTaeVL-QueryFrames — wild mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
-| **DW-KhotTaeVL-QueryFrames — benchmark mode** (task_type provided by dataset) | 0 | **66.3 %** | **+9.3 pp** |
 | Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp |
 **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
-0 regressions** in benchmark mode (task_type from Video-MME dataset).
 ## Why it works
@@ -95,7 +114,7 @@ result = fv.answer_mcq(
         "Stirs the oil in the pot",
         "Adds salt to the pot",
     ],
-    task_type=None,  # or e.g. "Action Recognition" for benchmark mode
 )
 print(result["pred"])              # e.g. 'B'
 print(result["frames_used"])       # 'query_aware' or 'uniform_fallback'
@@ -105,10 +124,10 @@ print(result["latency_gen_s"])     # ~3 s on Apple M4 MPS
 ### Two operating modes
-| Mode | What you pass | When to use | Acc 300 Q |
 |---|---|---|---:|
-| **Wild** | question + options | in-the-wild deployment with unknown task taxonomy | **64.3 %** |
-| **Benchmark** | + `task_type` string | benchmark eval where the dataset itself supplies the task taxonomy (Video-MME, etc.) | **66.3 %** |
 Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
 `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
@@ -116,11 +135,15 @@ trigger the uniform-fallback path: `"Object Reasoning"` and
 `"Temporal Reasoning"`. All other task strings (or `None`) use the
 query-aware path.
-> **Note on benchmark mode:** the +9.3 pp / 66.3 % number is a
-> *benchmark setting* — it relies on the dataset (Video-MME) supplying
-> the per-question task type as part of the standard input. It is
-> not achievable in deployment without that label. Wild mode (64.3 %,
-> +7.3 pp) is the in-the-wild figure when no task taxonomy is given.
 ## Per-task accuracy on Video-MME mini 300 Q
@@ -139,7 +162,7 @@ query-aware path.
 | Temporal Perception   |  8 | 0.625 | 0.750 | **+0.125** ⭐ |
 | Temporal Reasoning    |  8 | 0.250 | 0.250 |  +0.000  |
-(Benchmark mode shown — task_type provided by Video-MME dataset.
 ⭐ = Δ ≥ 5 pp.)
 ## What this is NOT
@@ -188,11 +211,11 @@ pip install torch transformers pillow decord huggingface_hub pandas pyarrow
 python eval_videomme.py --mode stock-uniform --n-questions 300 \
     --out-json stock_uniform_300q.json
-# 2. Reproduce wild-mode QA frames (writes wild_300q.json)
 python eval_videomme.py --mode wild --n-questions 300 \
     --out-json wild_300q.json
-# 3. Combine into benchmark mode via the hybrid policy
 python build_hybrid.py \
     --wild-json wild_300q.json \
     --stock-uniform-json stock_uniform_300q.json \
@@ -205,12 +228,12 @@ Expected results at 300 Q (greedy decoding, `do_sample=False`,
 | Output | Accuracy | Δ vs stock |
 |---|---:|---:|
 | `stock_uniform_300q.json` | 0.5700 | — |
-| `wild_300q.json` (wild mode) | 0.6433 | +7.3 pp |
-| `hybrid_300q.json` (benchmark mode) | 0.6633 | +9.3 pp |
 This artifact is **fully deterministic** at greedy decoding —
 re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
-in benchmark mode.
 > **Caveat — sample size and split.** All numbers above are on the
 > Video-MME *mini* split (the 300 questions whose videos ship in
@@ -224,9 +247,9 @@ This project builds on Qwen3-VL-2B-Instruct and uses a simple
 CLIP-based query-aware frame selection policy at inference time.
 Query-aware and adaptive frame selection for Video-LLMs is an active
-research direction. DW-KhotTaeVL-2B-QueryFrames is an independent
-engineering implementation focused on small-model, low-frame-budget
-video QA and CCTV-style deployment constraints.
 ## License

 **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
 A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
+for video multiple-choice / decision-style question answering. **No model
+weights are modified** — this method ships a CLIP-ViT-L/14-driven frame
+selector plus an optional task-type-aware uniform-fallback policy as a
 wrapper around the stock model.
 On Video-MME mini at 8-frame budget, this recovers **~44 % of the
+8-frame → 64-frame stock baseline gap in MCQ mode, and ~56 % in
+task-aware MCQ mode**, with zero training, zero parameter changes, and
 ~+0.4 s overhead per question.
+## Scope
+This release evaluates query-aware frame selection in a video
+multiple-choice / decision-style QA setting. The selector may use
+both the question text and the answer options as its CLIP query.
+This is appropriate for Video-MME-style MCQ benchmarks and for
+operational triage workflows where the system chooses among
+predefined actions or alert categories (e.g. *normal passage /
+restricted-zone entry / staff activity / false alarm*). It should
+**not** be read as an open-ended video-understanding benchmark claim.
+## Motivation
+This work started from CCTV / video-security R&D, where only a small
+number of frames can be sent to a VLM under latency and compute
+constraints. The released artifact is a general-purpose query-aware
+frame selector for video MCQ / decision-style video QA — not a
+product-specific CCTV model.
 ## TL;DR
 | Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock |
 |---|---:|---:|---:|
 | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
+| **QueryFrames — MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
+| **QueryFrames — Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
 | Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp |
 **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
+0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
 ## Why it works
         "Stirs the oil in the pot",
         "Adds salt to the pot",
     ],
+    task_type=None,  # or e.g. "Action Recognition" for task-aware MCQ mode
 )
 print(result["pred"])              # e.g. 'B'
 print(result["frames_used"])       # 'query_aware' or 'uniform_fallback'
 ### Two operating modes
+| Mode | Input | Use | Acc 300 Q |
 |---|---|---|---:|
+| **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
+| **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |
 Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
 `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
 `"Temporal Reasoning"`. All other task strings (or `None`) use the
 query-aware path.
+> **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
+> reported setting: it uses only the video, question, and answer
+> options, with no task taxonomy.
+>
+> **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
+> label supplied by Video-MME to route Object Reasoning and Temporal
+> Reasoning questions to uniform sampling. This is a benchmark /
+> controlled-workflow setting and is reported separately from default
+> MCQ mode.
 ## Per-task accuracy on Video-MME mini 300 Q
 | Temporal Perception   |  8 | 0.625 | 0.750 | **+0.125** ⭐ |
 | Temporal Reasoning    |  8 | 0.250 | 0.250 |  +0.000  |
+(Task-aware MCQ mode shown — task_type provided by Video-MME dataset.
 ⭐ = Δ ≥ 5 pp.)
 ## What this is NOT
 python eval_videomme.py --mode stock-uniform --n-questions 300 \
     --out-json stock_uniform_300q.json
+# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
 python eval_videomme.py --mode wild --n-questions 300 \
     --out-json wild_300q.json
+# 3. Combine into task-aware MCQ mode via the hybrid policy
 python build_hybrid.py \
     --wild-json wild_300q.json \
     --stock-uniform-json stock_uniform_300q.json \
 | Output | Accuracy | Δ vs stock |
 |---|---:|---:|
 | `stock_uniform_300q.json` | 0.5700 | — |
+| `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
+| `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |
 This artifact is **fully deterministic** at greedy decoding —
 re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
+in task-aware MCQ mode.
 > **Caveat — sample size and split.** All numbers above are on the
 > Video-MME *mini* split (the 300 questions whose videos ship in
 CLIP-based query-aware frame selection policy at inference time.
 Query-aware and adaptive frame selection for Video-LLMs is an active
+research direction. This release is an independent, simple CLIP-based
+inference-time implementation focused on small-model video MCQ /
+decision-style video QA under tight frame budgets.
 ## License