commandeaw commited on
Commit
d0f5738
·
verified ·
1 Parent(s): 50d1d87

Add scope section: full Video-MME mini 2700Q result (+0.22 pp)

Browse files

Following the 300Q release, the eval was extended to the full 2700Q split. Overall Δ +0.22 pp. README adds: (1) Scope note callout under TL;DR, (2) updated Caveat in Reproducibility section pointing to (3) new 'Scope on the full Video-MME mini (2700Q)' section. The original 300Q numbers are unchanged and remain reproducible by recipe; this addition characterizes the design envelope (short-clip, low-frame-budget) on the full balanced split.

Files changed (1) hide show
  1. README.md +21 -5
README.md CHANGED
@@ -62,6 +62,11 @@ product-specific CCTV model.
62
  **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
63
  0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
64
 
 
 
 
 
 
65
  ## Why it works
66
 
67
  Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
@@ -235,11 +240,22 @@ This artifact is **fully deterministic** at greedy decoding —
235
  re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
236
  in task-aware MCQ mode.
237
 
238
- > **Caveat — sample size and split.** All numbers above are on the
239
- > Video-MME *mini* split (the 300 questions whose videos ship in
240
- > `videos_chunked_01.zip`). They are **not** the full 2700-question
241
- > Video-MME benchmark and are not a leaderboard submission. A full-
242
- > benchmark eval is on the future-work list.
 
 
 
 
 
 
 
 
 
 
 
243
 
244
  ## Acknowledgements / Related Work
245
 
 
62
  **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
63
  0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
64
 
65
+ > **Scope note.** This method targets short-clip, low-frame-budget
66
+ > video QA. The 300 Q numbers above are inside that design envelope.
67
+ > On the full 2700 Q split, overall Δ is **+0.22 pp** — see
68
+ > [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.
69
+
70
  ## Why it works
71
 
72
  Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
 
240
  re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
241
  in task-aware MCQ mode.
242
 
243
+ > **Caveat — sample size and split.** The 300 Q numbers above are on
244
+ > the `videos_chunked_01.zip` mini subset, which happens to be mostly
245
+ > short clips. For full-split numbers on Video-MME mini 2700 Q
246
+ > (balanced short / medium / long), see
247
+ > [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
248
+ > below. This release is not a leaderboard submission.
249
+
250
+ ## Scope on the full Video-MME mini (2700 Q)
251
+
252
+ After the 300 Q release, the eval was extended to the full 2700 Q
253
+ split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
254
+ 53.33 %, **Δ +0.22 pp**.
255
+
256
+ This method targets short-clip, low-frame-budget video QA. The
257
+ 2700 Q split is balanced across short / medium / long-form clips;
258
+ averaging across that range dilutes the gain to roughly neutral.
259
 
260
  ## Acknowledgements / Related Work
261