commandeaw commited on
Commit
c04d819
Β·
verified Β·
1 Parent(s): 5e31798

Wording hotfix: scope clarity + MCQ-mode terminology

Browse files

Following audit, retire 'wild mode' / 'in-the-wild deployment' wording
in favor of:
- 'MCQ mode (no task_type)' for the default 64.3% setting
- 'Task-aware MCQ mode' for the 66.3% setting (uses dataset task taxonomy)

Add explicit Scope section: this release evaluates query-aware frame
selection for video MCQ / decision-style QA. The selector may use the
question + answer options as its CLIP query. This is appropriate for
Video-MME-style benchmarks and operational triage workflows where the
system chooses among predefined actions/alert categories. Not an
open-ended video understanding claim.

Add Motivation note: started from CCTV/video-security R&D under tight
frame-budget constraints. The artifact is general-purpose, not a
product-specific CCTV model.

Method, numbers (64.3% / 66.3%), and Qwen3-VL attribution unchanged.

Files changed (1) hide show
  1. README.md +49 -26
README.md CHANGED
@@ -21,27 +21,46 @@ library_name: transformers
21
  **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
22
 
23
  A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
24
- for video multiple-choice question answering. **No model weights are
25
- modified** β€” this method ships a CLIP-ViT-L/14-driven frame selector
26
- plus an optional task-type-aware uniform-fallback policy as a
27
  wrapper around the stock model.
28
 
29
  On Video-MME mini at 8-frame budget, this recovers **~44 % of the
30
- 8-frame β†’ 64-frame stock baseline gap in wild mode, and ~56 % in
31
- benchmark mode**, with zero training, zero parameter changes, and
32
  ~+0.4 s overhead per question.
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## TL;DR
35
 
36
  | Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ” vs stock |
37
  |---|---:|---:|---:|
38
  | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
39
- | **DW-KhotTaeVL-QueryFrames β€” wild mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
40
- | **DW-KhotTaeVL-QueryFrames β€” benchmark mode** (task_type provided by dataset) | 0 | **66.3 %** | **+9.3 pp** |
41
  | Stock Qwen3-VL-2B (uniform 64 f) β€” ceiling | 0 | 73.7 % | +16.7 pp |
42
 
43
  **12 of 12 task buckets non-negative; 8 strongly positive (β‰₯ 5 pp);
44
- 0 regressions** in benchmark mode (task_type from Video-MME dataset).
45
 
46
  ## Why it works
47
 
@@ -95,7 +114,7 @@ result = fv.answer_mcq(
95
  "Stirs the oil in the pot",
96
  "Adds salt to the pot",
97
  ],
98
- task_type=None, # or e.g. "Action Recognition" for benchmark mode
99
  )
100
  print(result["pred"]) # e.g. 'B'
101
  print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
@@ -105,10 +124,10 @@ print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
105
 
106
  ### Two operating modes
107
 
108
- | Mode | What you pass | When to use | Acc 300 Q |
109
  |---|---|---|---:|
110
- | **Wild** | question + options | in-the-wild deployment with unknown task taxonomy | **64.3 %** |
111
- | **Benchmark** | + `task_type` string | benchmark eval where the dataset itself supplies the task taxonomy (Video-MME, etc.) | **66.3 %** |
112
 
113
  Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
114
  `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
@@ -116,11 +135,15 @@ trigger the uniform-fallback path: `"Object Reasoning"` and
116
  `"Temporal Reasoning"`. All other task strings (or `None`) use the
117
  query-aware path.
118
 
119
- > **Note on benchmark mode:** the +9.3 pp / 66.3 % number is a
120
- > *benchmark setting* β€” it relies on the dataset (Video-MME) supplying
121
- > the per-question task type as part of the standard input. It is
122
- > not achievable in deployment without that label. Wild mode (64.3 %,
123
- > +7.3 pp) is the in-the-wild figure when no task taxonomy is given.
 
 
 
 
124
 
125
  ## Per-task accuracy on Video-MME mini 300 Q
126
 
@@ -139,7 +162,7 @@ query-aware path.
139
  | Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** ⭐ |
140
  | Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
141
 
142
- (Benchmark mode shown β€” task_type provided by Video-MME dataset.
143
  ⭐ = Ξ” β‰₯ 5 pp.)
144
 
145
  ## What this is NOT
@@ -188,11 +211,11 @@ pip install torch transformers pillow decord huggingface_hub pandas pyarrow
188
  python eval_videomme.py --mode stock-uniform --n-questions 300 \
189
  --out-json stock_uniform_300q.json
190
 
191
- # 2. Reproduce wild-mode QA frames (writes wild_300q.json)
192
  python eval_videomme.py --mode wild --n-questions 300 \
193
  --out-json wild_300q.json
194
 
195
- # 3. Combine into benchmark mode via the hybrid policy
196
  python build_hybrid.py \
197
  --wild-json wild_300q.json \
198
  --stock-uniform-json stock_uniform_300q.json \
@@ -205,12 +228,12 @@ Expected results at 300 Q (greedy decoding, `do_sample=False`,
205
  | Output | Accuracy | Ξ” vs stock |
206
  |---|---:|---:|
207
  | `stock_uniform_300q.json` | 0.5700 | β€” |
208
- | `wild_300q.json` (wild mode) | 0.6433 | +7.3 pp |
209
- | `hybrid_300q.json` (benchmark mode) | 0.6633 | +9.3 pp |
210
 
211
  This artifact is **fully deterministic** at greedy decoding β€”
212
  re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
213
- in benchmark mode.
214
 
215
  > **Caveat β€” sample size and split.** All numbers above are on the
216
  > Video-MME *mini* split (the 300 questions whose videos ship in
@@ -224,9 +247,9 @@ This project builds on Qwen3-VL-2B-Instruct and uses a simple
224
  CLIP-based query-aware frame selection policy at inference time.
225
 
226
  Query-aware and adaptive frame selection for Video-LLMs is an active
227
- research direction. DW-KhotTaeVL-2B-QueryFrames is an independent
228
- engineering implementation focused on small-model, low-frame-budget
229
- video QA and CCTV-style deployment constraints.
230
 
231
  ## License
232
 
 
21
  **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
22
 
23
  A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
24
+ for video multiple-choice / decision-style question answering. **No model
25
+ weights are modified** β€” this method ships a CLIP-ViT-L/14-driven frame
26
+ selector plus an optional task-type-aware uniform-fallback policy as a
27
  wrapper around the stock model.
28
 
29
  On Video-MME mini at 8-frame budget, this recovers **~44 % of the
30
+ 8-frame β†’ 64-frame stock baseline gap in MCQ mode, and ~56 % in
31
+ task-aware MCQ mode**, with zero training, zero parameter changes, and
32
  ~+0.4 s overhead per question.
33
 
34
+ ## Scope
35
+
36
+ This release evaluates query-aware frame selection in a video
37
+ multiple-choice / decision-style QA setting. The selector may use
38
+ both the question text and the answer options as its CLIP query.
39
+ This is appropriate for Video-MME-style MCQ benchmarks and for
40
+ operational triage workflows where the system chooses among
41
+ predefined actions or alert categories (e.g. *normal passage /
42
+ restricted-zone entry / staff activity / false alarm*). It should
43
+ **not** be read as an open-ended video-understanding benchmark claim.
44
+
45
+ ## Motivation
46
+
47
+ This work started from CCTV / video-security R&D, where only a small
48
+ number of frames can be sent to a VLM under latency and compute
49
+ constraints. The released artifact is a general-purpose query-aware
50
+ frame selector for video MCQ / decision-style video QA β€” not a
51
+ product-specific CCTV model.
52
+
53
  ## TL;DR
54
 
55
  | Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ” vs stock |
56
  |---|---:|---:|---:|
57
  | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
58
+ | **QueryFrames β€” MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
59
+ | **QueryFrames β€” Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
60
  | Stock Qwen3-VL-2B (uniform 64 f) β€” ceiling | 0 | 73.7 % | +16.7 pp |
61
 
62
  **12 of 12 task buckets non-negative; 8 strongly positive (β‰₯ 5 pp);
63
+ 0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
64
 
65
  ## Why it works
66
 
 
114
  "Stirs the oil in the pot",
115
  "Adds salt to the pot",
116
  ],
117
+ task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode
118
  )
119
  print(result["pred"]) # e.g. 'B'
120
  print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
 
124
 
125
  ### Two operating modes
126
 
127
+ | Mode | Input | Use | Acc 300 Q |
128
  |---|---|---|---:|
129
+ | **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
130
+ | **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |
131
 
132
  Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
133
  `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
 
135
  `"Temporal Reasoning"`. All other task strings (or `None`) use the
136
  query-aware path.
137
 
138
+ > **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
139
+ > reported setting: it uses only the video, question, and answer
140
+ > options, with no task taxonomy.
141
+ >
142
+ > **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
143
+ > label supplied by Video-MME to route Object Reasoning and Temporal
144
+ > Reasoning questions to uniform sampling. This is a benchmark /
145
+ > controlled-workflow setting and is reported separately from default
146
+ > MCQ mode.
147
 
148
  ## Per-task accuracy on Video-MME mini 300 Q
149
 
 
162
  | Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** ⭐ |
163
  | Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
164
 
165
+ (Task-aware MCQ mode shown β€” task_type provided by Video-MME dataset.
166
  ⭐ = Ξ” β‰₯ 5 pp.)
167
 
168
  ## What this is NOT
 
211
  python eval_videomme.py --mode stock-uniform --n-questions 300 \
212
  --out-json stock_uniform_300q.json
213
 
214
+ # 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
215
  python eval_videomme.py --mode wild --n-questions 300 \
216
  --out-json wild_300q.json
217
 
218
+ # 3. Combine into task-aware MCQ mode via the hybrid policy
219
  python build_hybrid.py \
220
  --wild-json wild_300q.json \
221
  --stock-uniform-json stock_uniform_300q.json \
 
228
  | Output | Accuracy | Ξ” vs stock |
229
  |---|---:|---:|
230
  | `stock_uniform_300q.json` | 0.5700 | β€” |
231
+ | `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
232
+ | `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |
233
 
234
  This artifact is **fully deterministic** at greedy decoding β€”
235
  re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
236
+ in task-aware MCQ mode.
237
 
238
  > **Caveat β€” sample size and split.** All numbers above are on the
239
  > Video-MME *mini* split (the 300 questions whose videos ship in
 
247
  CLIP-based query-aware frame selection policy at inference time.
248
 
249
  Query-aware and adaptive frame selection for Video-LLMs is an active
250
+ research direction. This release is an independent, simple CLIP-based
251
+ inference-time implementation focused on small-model video MCQ /
252
+ decision-style video QA under tight frame budgets.
253
 
254
  ## License
255