Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Add scope section: full Video-MME mini 2700Q result (+0.22 pp)
Browse filesFollowing the 300Q release, the eval was extended to the full 2700Q split. Overall Δ +0.22 pp. README adds: (1) Scope note callout under TL;DR, (2) updated Caveat in Reproducibility section pointing to (3) new 'Scope on the full Video-MME mini (2700Q)' section. The original 300Q numbers are unchanged and remain reproducible by recipe; this addition characterizes the design envelope (short-clip, low-frame-budget) on the full balanced split.
README.md
CHANGED
|
@@ -62,6 +62,11 @@ product-specific CCTV model.
|
|
| 62 |
**12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
|
| 63 |
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
## Why it works
|
| 66 |
|
| 67 |
Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
|
|
@@ -235,11 +240,22 @@ This artifact is **fully deterministic** at greedy decoding —
|
|
| 235 |
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
|
| 236 |
in task-aware MCQ mode.
|
| 237 |
|
| 238 |
-
> **Caveat — sample size and split.**
|
| 239 |
-
>
|
| 240 |
-
>
|
| 241 |
-
>
|
| 242 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
## Acknowledgements / Related Work
|
| 245 |
|
|
|
|
| 62 |
**12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
|
| 63 |
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).
|
| 64 |
|
| 65 |
+
> **Scope note.** This method targets short-clip, low-frame-budget
|
| 66 |
+
> video QA. The 300 Q numbers above are inside that design envelope.
|
| 67 |
+
> On the full 2700 Q split, overall Δ is **+0.22 pp** — see
|
| 68 |
+
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.
|
| 69 |
+
|
| 70 |
## Why it works
|
| 71 |
|
| 72 |
Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
|
|
|
|
| 240 |
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
|
| 241 |
in task-aware MCQ mode.
|
| 242 |
|
| 243 |
+
> **Caveat — sample size and split.** The 300 Q numbers above are on
|
| 244 |
+
> the `videos_chunked_01.zip` mini subset, which happens to be mostly
|
| 245 |
+
> short clips. For full-split numbers on Video-MME mini 2700 Q
|
| 246 |
+
> (balanced short / medium / long), see
|
| 247 |
+
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
|
| 248 |
+
> below. This release is not a leaderboard submission.
|
| 249 |
+
|
| 250 |
+
## Scope on the full Video-MME mini (2700 Q)
|
| 251 |
+
|
| 252 |
+
After the 300 Q release, the eval was extended to the full 2700 Q
|
| 253 |
+
split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
|
| 254 |
+
53.33 %, **Δ +0.22 pp**.
|
| 255 |
+
|
| 256 |
+
This method targets short-clip, low-frame-budget video QA. The
|
| 257 |
+
2700 Q split is balanced across short / medium / long-form clips;
|
| 258 |
+
averaging across that range dilutes the gain to roughly neutral.
|
| 259 |
|
| 260 |
## Acknowledgements / Related Work
|
| 261 |
|