Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Initial release: DW-KhotTaeVL-2B-QueryFrames v1.0
Browse filesQuery-aware frame selection wrapper for Qwen3-VL-2B-Instruct.
Wild mode: 64.3% on Video-MME mini 300Q (+7.3pp vs stock 57.0%).
Benchmark mode: 66.3% (+9.3pp), 12/12 task buckets non-negative.
Zero trainable parameters, no model weights modified.
Built on Qwen/Qwen3-VL-2B-Instruct (Apache 2.0).
Frame scorer: openai/clip-vit-large-patch14 (MIT).
Author: Deaw (HF: @commandeaw ).
- LICENSE +17 -0
- NOTICE +39 -0
- README.md +272 -0
- build_hybrid.py +160 -0
- dw_queryframes.py +223 -0
- eval_videomme.py +233 -0
- example_usage.py +59 -0
LICENSE
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Apache License
|
| 2 |
+
Version 2.0, January 2004
|
| 3 |
+
http://www.apache.org/licenses/
|
| 4 |
+
|
| 5 |
+
Copyright 2026 Deaw (HF: @commandeaw)
|
| 6 |
+
|
| 7 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 8 |
+
you may not use this file except in compliance with the License.
|
| 9 |
+
You may obtain a copy of the License at
|
| 10 |
+
|
| 11 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 12 |
+
|
| 13 |
+
Unless required by applicable law or agreed to in writing, software
|
| 14 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 15 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 16 |
+
See the License for the specific language governing permissions and
|
| 17 |
+
limitations under the License.
|
NOTICE
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
DW-KhotTaeVL-2B-QueryFrames
|
| 2 |
+
============================
|
| 3 |
+
|
| 4 |
+
Copyright 2026 Deaw (HF: @commandeaw)
|
| 5 |
+
|
| 6 |
+
This product is released by Deaw under the Apache License,
|
| 7 |
+
Version 2.0. Personal research project, not affiliated with any
|
| 8 |
+
commercial entity.
|
| 9 |
+
|
| 10 |
+
----
|
| 11 |
+
|
| 12 |
+
This product builds on the following third-party components:
|
| 13 |
+
|
| 14 |
+
1. Qwen3-VL-2B-Instruct
|
| 15 |
+
Copyright Alibaba Cloud / Qwen Team
|
| 16 |
+
Licensed under the Apache License, Version 2.0
|
| 17 |
+
https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
|
| 18 |
+
|
| 19 |
+
Per the Apache 2.0 license, the base model weights are reused
|
| 20 |
+
without modification by this derivative. Always credit the base
|
| 21 |
+
model when using DW-KhotTaeVL-2B-QueryFrames.
|
| 22 |
+
|
| 23 |
+
2. CLIP-ViT-Large-Patch14
|
| 24 |
+
Copyright OpenAI
|
| 25 |
+
Licensed under the MIT License
|
| 26 |
+
https://huggingface.co/openai/clip-vit-large-patch14
|
| 27 |
+
|
| 28 |
+
Used as a query-aware frame scorer.
|
| 29 |
+
|
| 30 |
+
3. Video-MME (evaluation only — not redistributed)
|
| 31 |
+
Copyright the original authors (Fu et al. 2024)
|
| 32 |
+
See: https://huggingface.co/datasets/lmms-lab/Video-MME
|
| 33 |
+
|
| 34 |
+
----
|
| 35 |
+
|
| 36 |
+
NO WARRANTY
|
| 37 |
+
|
| 38 |
+
This software is provided "AS IS" without warranty of any kind.
|
| 39 |
+
See LICENSE for full terms.
|
README.md
ADDED
|
@@ -0,0 +1,272 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- video
|
| 7 |
+
- video-question-answering
|
| 8 |
+
- multimodal
|
| 9 |
+
- vision-language
|
| 10 |
+
- qwen3-vl
|
| 11 |
+
- inference-time
|
| 12 |
+
- frame-selection
|
| 13 |
+
- clip
|
| 14 |
+
base_model: Qwen/Qwen3-VL-2B-Instruct
|
| 15 |
+
pipeline_tag: video-text-to-text
|
| 16 |
+
library_name: transformers
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# DW-KhotTaeVL-2B-QueryFrames
|
| 20 |
+
|
| 21 |
+
**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**
|
| 22 |
+
|
| 23 |
+
A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
|
| 24 |
+
for video multiple-choice question answering. **No model weights are
|
| 25 |
+
modified** — this method ships a CLIP-ViT-L/14-driven frame selector
|
| 26 |
+
plus an optional task-type-aware uniform-fallback policy as a
|
| 27 |
+
wrapper around the stock model.
|
| 28 |
+
|
| 29 |
+
On Video-MME mini at 8-frame budget, this recovers **56 % of the
|
| 30 |
+
8-frame → 64-frame stock baseline gap with zero training, zero
|
| 31 |
+
parameter changes, and ~+0.4 s overhead per question**.
|
| 32 |
+
|
| 33 |
+
## TL;DR
|
| 34 |
+
|
| 35 |
+
| Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock |
|
| 36 |
+
|---|---:|---:|---:|
|
| 37 |
+
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
|
| 38 |
+
| **DW-KhotTaeVL-QueryFrames — wild mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
|
| 39 |
+
| **DW-KhotTaeVL-QueryFrames — benchmark mode** (task_type provided by dataset) | 0 | **66.3 %** | **+9.3 pp** |
|
| 40 |
+
| Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp |
|
| 41 |
+
|
| 42 |
+
**12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
|
| 43 |
+
0 regressions** in benchmark mode (task_type from Video-MME dataset).
|
| 44 |
+
|
| 45 |
+
## Why it works
|
| 46 |
+
|
| 47 |
+
Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
|
| 48 |
+
The gap is *by definition* a frame-coverage problem (same model, same
|
| 49 |
+
prompt, only frame budget changes). The bottleneck is **which 8
|
| 50 |
+
frames you give the model**, not the model itself.
|
| 51 |
+
|
| 52 |
+
DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
|
| 53 |
+
question* via CLIP-ViT-L/14 cosine similarity. For two task types
|
| 54 |
+
where 64-frame stock does *not* outperform 8-frame stock (Object
|
| 55 |
+
Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
|
| 56 |
+
hybrid policy reverts to uniform sampling — frame coverage is not
|
| 57 |
+
the bottleneck for those questions, and CLIP scoring can mis-pick.
|
| 58 |
+
|
| 59 |
+
## Pipeline
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
For each (video, question, options[A,B,C,D]):
|
| 63 |
+
1. Sample 32 uniformly-spaced candidate frames.
|
| 64 |
+
2. Encode question text with CLIP-ViT-L/14 → 768-d text vector.
|
| 65 |
+
3. Encode candidate frames → 768-d image vectors.
|
| 66 |
+
4. Cosine similarity → pick top-8 (or uniform-8 if task is
|
| 67 |
+
Object Reasoning / Temporal Reasoning, when task_type is given).
|
| 68 |
+
5. Sort selected 8 frames by original temporal index.
|
| 69 |
+
6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
|
| 70 |
+
7. Extract letter from output.
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Usage
|
| 74 |
+
|
| 75 |
+
### Install dependencies
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
pip install torch transformers pillow decord huggingface_hub
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### Minimal example
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
from dw_queryframes import QueryFrames
|
| 85 |
+
|
| 86 |
+
fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu
|
| 87 |
+
|
| 88 |
+
result = fv.answer_mcq(
|
| 89 |
+
video_path="cooking.mp4",
|
| 90 |
+
question="What does the chef do after pouring the oil into the pot?",
|
| 91 |
+
options=[
|
| 92 |
+
"Chops fresh green herbs",
|
| 93 |
+
"Pours broth into the pot",
|
| 94 |
+
"Stirs the oil in the pot",
|
| 95 |
+
"Adds salt to the pot",
|
| 96 |
+
],
|
| 97 |
+
task_type=None, # or e.g. "Action Recognition" for benchmark mode
|
| 98 |
+
)
|
| 99 |
+
print(result["pred"]) # e.g. 'B'
|
| 100 |
+
print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
|
| 101 |
+
print(result["latency_clip_s"]) # ~0.4 s
|
| 102 |
+
print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### Two operating modes
|
| 106 |
+
|
| 107 |
+
| Mode | What you pass | When to use | Acc 300 Q |
|
| 108 |
+
|---|---|---|---:|
|
| 109 |
+
| **Wild** | question + options | in-the-wild deployment with unknown task taxonomy | **64.3 %** |
|
| 110 |
+
| **Benchmark** | + `task_type` string | benchmark eval where the dataset itself supplies the task taxonomy (Video-MME, etc.) | **66.3 %** |
|
| 111 |
+
|
| 112 |
+
Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
|
| 113 |
+
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
|
| 114 |
+
trigger the uniform-fallback path: `"Object Reasoning"` and
|
| 115 |
+
`"Temporal Reasoning"`. All other task strings (or `None`) use the
|
| 116 |
+
query-aware path.
|
| 117 |
+
|
| 118 |
+
> **Note on benchmark mode:** the +9.3 pp / 66.3 % number is a
|
| 119 |
+
> *benchmark setting* — it relies on the dataset (Video-MME) supplying
|
| 120 |
+
> the per-question task type as part of the standard input. It is
|
| 121 |
+
> not achievable in deployment without that label. Wild mode (64.3 %,
|
| 122 |
+
> +7.3 pp) is the in-the-wild figure when no task taxonomy is given.
|
| 123 |
+
|
| 124 |
+
## Per-task accuracy on Video-MME mini 300 Q
|
| 125 |
+
|
| 126 |
+
| Task | n | Stock 8 f | QueryFrames | Δ |
|
| 127 |
+
|---|---:|---:|---:|---:|
|
| 128 |
+
| Action Reasoning | 9 | 0.444 | 0.667 | **+0.222** ⭐ |
|
| 129 |
+
| Action Recognition | 45 | 0.489 | 0.644 | **+0.156** ⭐ |
|
| 130 |
+
| Attribute Perception | 37 | 0.730 | 0.811 | **+0.081** ⭐ |
|
| 131 |
+
| Counting Problem | 34 | 0.265 | 0.353 | **+0.088** ⭐ |
|
| 132 |
+
| Information Synopsis | 30 | 0.800 | 0.800 | +0.000 |
|
| 133 |
+
| OCR Problems | 23 | 0.391 | 0.609 | **+0.217** ⭐ |
|
| 134 |
+
| Object Reasoning | 36 | 0.722 | 0.722 | +0.000 |
|
| 135 |
+
| Object Recognition | 51 | 0.588 | 0.667 | **+0.078** ⭐ |
|
| 136 |
+
| Spatial Perception | 10 | 0.600 | 0.700 | **+0.100** ⭐ |
|
| 137 |
+
| Spatial Reasoning | 9 | 0.778 | 1.000 | **+0.222** ⭐ |
|
| 138 |
+
| Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** ⭐ |
|
| 139 |
+
| Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
|
| 140 |
+
|
| 141 |
+
(Benchmark mode shown — task_type provided by Video-MME dataset.
|
| 142 |
+
⭐ = Δ ≥ 5 pp.)
|
| 143 |
+
|
| 144 |
+
## What this is NOT
|
| 145 |
+
|
| 146 |
+
- It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are
|
| 147 |
+
unchanged. You can verify with the standard Hugging Face model
|
| 148 |
+
hash check.
|
| 149 |
+
- It is **not** a leaderboard submission claim. The numbers above are
|
| 150 |
+
on the publicly-available Video-MME mini split (300 Q, filtered to
|
| 151 |
+
videos available locally via the standard mini chunks).
|
| 152 |
+
- It is **not** a replacement for fine-tuning when you have abundant
|
| 153 |
+
domain data. For domain-shifted deployments (e.g. surveillance
|
| 154 |
+
video), training-based adaptation may be required.
|
| 155 |
+
|
| 156 |
+
## Hardware
|
| 157 |
+
|
| 158 |
+
Runs on:
|
| 159 |
+
|
| 160 |
+
| Device | Notes |
|
| 161 |
+
|---|---|
|
| 162 |
+
| Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
|
| 163 |
+
| NVIDIA A100 / H100 (CUDA) | works; faster |
|
| 164 |
+
| CPU (BF16-capable) | works but slow |
|
| 165 |
+
|
| 166 |
+
VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
|
| 167 |
+
8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.
|
| 168 |
+
|
| 169 |
+
## Reproducibility
|
| 170 |
+
|
| 171 |
+
All numbers in this card are reproducible from a fresh clone of this
|
| 172 |
+
repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
|
| 173 |
+
(filtered to its `videos_chunked_01.zip` mini split).
|
| 174 |
+
|
| 175 |
+
The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
|
| 176 |
+
**self-contained** — they have no external project dependencies beyond
|
| 177 |
+
the local `dw_queryframes.py` module and standard Python /
|
| 178 |
+
Hugging Face / PyTorch packages.
|
| 179 |
+
|
| 180 |
+
### Three-command reproduction recipe
|
| 181 |
+
|
| 182 |
+
```bash
|
| 183 |
+
# Install deps
|
| 184 |
+
pip install torch transformers pillow decord huggingface_hub pandas pyarrow
|
| 185 |
+
|
| 186 |
+
# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
|
| 187 |
+
python eval_videomme.py --mode stock-uniform --n-questions 300 \
|
| 188 |
+
--out-json stock_uniform_300q.json
|
| 189 |
+
|
| 190 |
+
# 2. Reproduce wild-mode QA frames (writes wild_300q.json)
|
| 191 |
+
python eval_videomme.py --mode wild --n-questions 300 \
|
| 192 |
+
--out-json wild_300q.json
|
| 193 |
+
|
| 194 |
+
# 3. Combine into benchmark mode via the hybrid policy
|
| 195 |
+
python build_hybrid.py \
|
| 196 |
+
--wild-json wild_300q.json \
|
| 197 |
+
--stock-uniform-json stock_uniform_300q.json \
|
| 198 |
+
--out-json hybrid_300q.json
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
Expected results at 300 Q (greedy decoding, `do_sample=False`,
|
| 202 |
+
`max_pixels=262144`):
|
| 203 |
+
|
| 204 |
+
| Output | Accuracy | Δ vs stock |
|
| 205 |
+
|---|---:|---:|
|
| 206 |
+
| `stock_uniform_300q.json` | 0.5700 | — |
|
| 207 |
+
| `wild_300q.json` (wild mode) | 0.6433 | +7.3 pp |
|
| 208 |
+
| `hybrid_300q.json` (benchmark mode) | 0.6633 | +9.3 pp |
|
| 209 |
+
|
| 210 |
+
This artifact is **fully deterministic** at greedy decoding —
|
| 211 |
+
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
|
| 212 |
+
in benchmark mode.
|
| 213 |
+
|
| 214 |
+
> **Caveat — sample size and split.** All numbers above are on the
|
| 215 |
+
> Video-MME *mini* split (the 300 questions whose videos ship in
|
| 216 |
+
> `videos_chunked_01.zip`). They are **not** the full 2700-question
|
| 217 |
+
> Video-MME benchmark and are not a leaderboard submission. A full-
|
| 218 |
+
> benchmark eval is on the future-work list.
|
| 219 |
+
|
| 220 |
+
## License
|
| 221 |
+
|
| 222 |
+
| Component | License | Source |
|
| 223 |
+
|---|---|---|
|
| 224 |
+
| This wrapper code | Apache 2.0 | this repo |
|
| 225 |
+
| Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
|
| 226 |
+
| Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
|
| 227 |
+
| Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |
|
| 228 |
+
|
| 229 |
+
When using or citing this work, please credit the base model:
|
| 230 |
+
|
| 231 |
+
> Built on Qwen3-VL-2B-Instruct (Apache 2.0).
|
| 232 |
+
> Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).
|
| 233 |
+
|
| 234 |
+
## Citation
|
| 235 |
+
|
| 236 |
+
```bibtex
|
| 237 |
+
@misc{dw-khottaevl-2b-queryframes-2026,
|
| 238 |
+
author = {Deaw},
|
| 239 |
+
title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
|
| 240 |
+
for Video MCQ on Qwen3-VL-2B-Instruct},
|
| 241 |
+
year = {2026},
|
| 242 |
+
publisher = {Hugging Face},
|
| 243 |
+
url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
@misc{qwen3vl2025,
|
| 247 |
+
title = {Qwen3-VL: Multilingual Vision-Language Models},
|
| 248 |
+
author = {Qwen Team},
|
| 249 |
+
year = {2025},
|
| 250 |
+
}
|
| 251 |
+
|
| 252 |
+
@inproceedings{radford2021clip,
|
| 253 |
+
title = {Learning Transferable Visual Models From Natural Language Supervision},
|
| 254 |
+
author = {Radford, Alec and Kim, Jong Wook and others},
|
| 255 |
+
booktitle = {ICML},
|
| 256 |
+
year = {2021},
|
| 257 |
+
}
|
| 258 |
+
|
| 259 |
+
@misc{videomme2024,
|
| 260 |
+
title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
|
| 261 |
+
of Multi-modal LLMs in Video Analysis},
|
| 262 |
+
author = {Fu, Chaoyou and others},
|
| 263 |
+
year = {2024},
|
| 264 |
+
}
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
## Author
|
| 268 |
+
|
| 269 |
+
**Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) — independent
|
| 270 |
+
ML practitioner. Personal research release.
|
| 271 |
+
|
| 272 |
+
Issues / questions: open an issue on the model repo.
|
build_hybrid.py
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Standalone benchmark-mode hybrid policy builder.
|
| 2 |
+
|
| 3 |
+
Combines two eval JSONs (wild-mode QA and stock-uniform-8f) by selecting,
|
| 4 |
+
per question, whichever prediction the policy says to use:
|
| 5 |
+
|
| 6 |
+
- If task_type ∈ {Object Reasoning, Temporal Reasoning} → take stock-uniform pred
|
| 7 |
+
(these are tasks where Video-MME 64f stock does NOT outperform 8f stock,
|
| 8 |
+
so query-aware frame selection cannot help).
|
| 9 |
+
- Else → take wild-mode (query-aware) pred.
|
| 10 |
+
|
| 11 |
+
This is a pure post-hoc combination of two prediction sets — it runs no
|
| 12 |
+
inference, takes no GPU. The output JSON has the same shape as the
|
| 13 |
+
eval JSONs, with an added ``policy_source`` field per result row.
|
| 14 |
+
|
| 15 |
+
Usage::
|
| 16 |
+
|
| 17 |
+
python eval_videomme.py --mode wild --n-questions 300 \\
|
| 18 |
+
--out-json wild_300q.json
|
| 19 |
+
python eval_videomme.py --mode stock-uniform --n-questions 300 \\
|
| 20 |
+
--out-json stock_uniform_300q.json
|
| 21 |
+
python build_hybrid.py \\
|
| 22 |
+
--wild-json wild_300q.json \\
|
| 23 |
+
--stock-uniform-json stock_uniform_300q.json \\
|
| 24 |
+
--out-json hybrid_300q.json
|
| 25 |
+
"""
|
| 26 |
+
from __future__ import annotations
|
| 27 |
+
|
| 28 |
+
import argparse
|
| 29 |
+
import json
|
| 30 |
+
from collections import defaultdict
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# Tasks where Video-MME stock-64f does NOT outperform stock-8f on the
|
| 35 |
+
# 300Q mini split (measured: Object Reasoning Δ -0.083, Temporal
|
| 36 |
+
# Reasoning Δ +0.000). For these tasks frame coverage is not the
|
| 37 |
+
# bottleneck, so the hybrid policy reverts to uniform sampling.
|
| 38 |
+
NO_FRAME_GAIN_TASKS = frozenset({"Object Reasoning", "Temporal Reasoning"})
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def load_eval(path: str | Path) -> tuple[dict, list[dict]]:
|
| 42 |
+
"""Read a Video-MME eval JSON. Returns (summary, results)."""
|
| 43 |
+
d = json.loads(Path(path).read_text())
|
| 44 |
+
return d.get("summary", {}), d.get("results", [])
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def main() -> int:
|
| 48 |
+
ap = argparse.ArgumentParser()
|
| 49 |
+
ap.add_argument("--wild-json", required=True,
|
| 50 |
+
help="path to wild-mode eval JSON (QA frames). "
|
| 51 |
+
"Produced by `eval_videomme.py --mode wild`.")
|
| 52 |
+
ap.add_argument("--stock-uniform-json", required=True,
|
| 53 |
+
help="path to stock-uniform-8f eval JSON. "
|
| 54 |
+
"Produced by `eval_videomme.py --mode stock-uniform`.")
|
| 55 |
+
ap.add_argument("--out-json", required=True,
|
| 56 |
+
help="output hybrid JSON path")
|
| 57 |
+
args = ap.parse_args()
|
| 58 |
+
|
| 59 |
+
wild_summary, wild_results = load_eval(args.wild_json)
|
| 60 |
+
stk_summary, stk_results = load_eval(args.stock_uniform_json)
|
| 61 |
+
|
| 62 |
+
wild_by = {r["index"]: r for r in wild_results}
|
| 63 |
+
stk_by = {r["index"]: r for r in stk_results}
|
| 64 |
+
common = sorted(set(wild_by) & set(stk_by))
|
| 65 |
+
|
| 66 |
+
if not common:
|
| 67 |
+
raise SystemExit(
|
| 68 |
+
"[hybrid] no overlapping question indices between the two "
|
| 69 |
+
"eval JSONs — make sure both runs used the same n_questions "
|
| 70 |
+
"and chunks.")
|
| 71 |
+
|
| 72 |
+
if len(common) != len(wild_by) or len(common) != len(stk_by):
|
| 73 |
+
print(f"[hybrid] WARN: wild={len(wild_by)} stock-uniform={len(stk_by)} "
|
| 74 |
+
f"overlap={len(common)}; computing on overlap only.")
|
| 75 |
+
|
| 76 |
+
hybrid_results = []
|
| 77 |
+
src_count = {"query_aware": 0, "uniform_fallback": 0}
|
| 78 |
+
for i in common:
|
| 79 |
+
w, s = wild_by[i], stk_by[i]
|
| 80 |
+
task = w.get("task_type", "")
|
| 81 |
+
use_uniform = task in NO_FRAME_GAIN_TASKS
|
| 82 |
+
chosen = s if use_uniform else w
|
| 83 |
+
src_count["uniform_fallback" if use_uniform else "query_aware"] += 1
|
| 84 |
+
hybrid_results.append({
|
| 85 |
+
"index": i,
|
| 86 |
+
"videoID": w.get("videoID"),
|
| 87 |
+
"task_type": task,
|
| 88 |
+
"gold": w.get("gold"),
|
| 89 |
+
"pred": chosen.get("pred"),
|
| 90 |
+
"correct": chosen.get("correct"),
|
| 91 |
+
"policy_source": ("uniform_fallback" if use_uniform else "query_aware"),
|
| 92 |
+
})
|
| 93 |
+
|
| 94 |
+
n = len(hybrid_results)
|
| 95 |
+
correct = sum(1 for r in hybrid_results if r["correct"])
|
| 96 |
+
acc = correct / n if n else 0.0
|
| 97 |
+
qa_acc = sum(1 for i in common if wild_by[i]["correct"]) / len(common)
|
| 98 |
+
sk_acc = sum(1 for i in common if stk_by[i]["correct"]) / len(common)
|
| 99 |
+
|
| 100 |
+
summary = {
|
| 101 |
+
"tag": "benchmark_mode_hybrid",
|
| 102 |
+
"policy": ("uniform-fallback for tasks where stock-64f does not "
|
| 103 |
+
"exceed stock-8f (Object Reasoning, Temporal Reasoning); "
|
| 104 |
+
"query-aware otherwise"),
|
| 105 |
+
"no_frame_gain_tasks": sorted(NO_FRAME_GAIN_TASKS),
|
| 106 |
+
"n_questions": n,
|
| 107 |
+
"accuracy": round(acc, 4),
|
| 108 |
+
"wild_accuracy": round(qa_acc, 4),
|
| 109 |
+
"stock_uniform_accuracy": round(sk_acc, 4),
|
| 110 |
+
"delta_hybrid_vs_stock_uniform": round(acc - sk_acc, 4),
|
| 111 |
+
"delta_hybrid_vs_wild": round(acc - qa_acc, 4),
|
| 112 |
+
"policy_source_counts": src_count,
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
out_path = Path(args.out_json)
|
| 116 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 117 |
+
out_path.write_text(json.dumps(
|
| 118 |
+
{"summary": summary, "results": hybrid_results},
|
| 119 |
+
indent=2, ensure_ascii=False))
|
| 120 |
+
print(f"[hybrid] wrote {out_path}")
|
| 121 |
+
print(f"[hybrid] hybrid acc = {acc:.4f} "
|
| 122 |
+
f"(wild {qa_acc:.4f}, stock-uniform {sk_acc:.4f})")
|
| 123 |
+
print(f"[hybrid] Δ vs stock = {acc-sk_acc:+.4f} "
|
| 124 |
+
f"Δ vs wild = {acc-qa_acc:+.4f}")
|
| 125 |
+
print(f"[hybrid] policy: query_aware={src_count['query_aware']} "
|
| 126 |
+
f"uniform_fallback={src_count['uniform_fallback']}")
|
| 127 |
+
|
| 128 |
+
# Per-task breakdown for transparency.
|
| 129 |
+
by_task = defaultdict(lambda: [0, 0])
|
| 130 |
+
by_task_w = defaultdict(lambda: [0, 0])
|
| 131 |
+
by_task_s = defaultdict(lambda: [0, 0])
|
| 132 |
+
for r in hybrid_results:
|
| 133 |
+
t = r["task_type"]
|
| 134 |
+
by_task[t][1] += 1
|
| 135 |
+
by_task[t][0] += int(r["correct"])
|
| 136 |
+
for r in wild_results:
|
| 137 |
+
t = r.get("task_type", "")
|
| 138 |
+
by_task_w[t][1] += 1
|
| 139 |
+
by_task_w[t][0] += int(r["correct"])
|
| 140 |
+
for r in stk_results:
|
| 141 |
+
t = r.get("task_type", "")
|
| 142 |
+
by_task_s[t][1] += 1
|
| 143 |
+
by_task_s[t][0] += int(r["correct"])
|
| 144 |
+
|
| 145 |
+
print(f"\n=== per-task (n / stock-uniform / wild / hybrid / Δ_hyb_vs_stock) ===")
|
| 146 |
+
for t in sorted(by_task):
|
| 147 |
+
n_t = by_task[t][1]
|
| 148 |
+
s_acc = by_task_s[t][0]/by_task_s[t][1] if by_task_s[t][1] else 0
|
| 149 |
+
w_acc = by_task_w[t][0]/by_task_w[t][1] if by_task_w[t][1] else 0
|
| 150 |
+
h_acc = by_task[t][0]/n_t if n_t else 0
|
| 151 |
+
d = h_acc - s_acc
|
| 152 |
+
flag = " ⭐" if d >= 0.05 else (" ⚠️" if d <= -0.05 else "")
|
| 153 |
+
print(f" {t:<25s} n={n_t:>3d} s={s_acc:.3f} w={w_acc:.3f} "
|
| 154 |
+
f"h={h_acc:.3f} Δ_hyb_vs_s={d:+.3f}{flag}")
|
| 155 |
+
return 0
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
if __name__ == "__main__":
|
| 159 |
+
import sys
|
| 160 |
+
sys.exit(main())
|
dw_queryframes.py
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""DW-KhotTaeVL-2B-QueryFrames — query-aware frame selection for video MCQ.
|
| 2 |
+
|
| 3 |
+
Single-file inference module. Wraps stock Qwen3-VL-2B-Instruct with a
|
| 4 |
+
CLIP-ViT-L/14 query-aware frame selector and an optional task-type-aware
|
| 5 |
+
uniform-fallback policy.
|
| 6 |
+
|
| 7 |
+
Usage::
|
| 8 |
+
|
| 9 |
+
from dw_queryframes import QueryFrames
|
| 10 |
+
fv = QueryFrames(device="mps")
|
| 11 |
+
answer = fv.answer_mcq(
|
| 12 |
+
video_path="cooking.mp4",
|
| 13 |
+
question="What does the chef do after pouring the oil?",
|
| 14 |
+
options=["Stirs the oil", "Adds salt", "Pours broth", "Chops herbs"],
|
| 15 |
+
task_type=None, # or "Action Recognition" etc. for hybrid mode
|
| 16 |
+
)
|
| 17 |
+
|
| 18 |
+
License: Apache 2.0 (this code)
|
| 19 |
+
Copyright 2026 Deaw (HF: @commandeaw)
|
| 20 |
+
Base model: Qwen3-VL-2B-Instruct (Apache 2.0)
|
| 21 |
+
Frame scorer: openai/clip-vit-large-patch14 (MIT)
|
| 22 |
+
|
| 23 |
+
Always credit Qwen3-VL-Instruct as the base when using this work.
|
| 24 |
+
"""
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import re
|
| 28 |
+
import os
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
from typing import Optional
|
| 31 |
+
|
| 32 |
+
import torch
|
| 33 |
+
import torch.nn.functional as F
|
| 34 |
+
from PIL import Image
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
# Tasks where stock-64f does NOT outperform stock-8f on Video-MME mini
|
| 38 |
+
# (measured: Object Reasoning Δ -0.083, Temporal Reasoning Δ +0.000).
|
| 39 |
+
# For these tasks, frame-coverage is not the bottleneck; uniform sampling
|
| 40 |
+
# is at least as good as query-aware. The hybrid policy uses uniform
|
| 41 |
+
# selection for these task types when a label is provided.
|
| 42 |
+
NO_FRAME_GAIN_TASKS = frozenset({"Object Reasoning", "Temporal Reasoning"})
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
PROMPT_TEMPLATE = (
|
| 46 |
+
"Select the best answer based on the video.\n\n"
|
| 47 |
+
"Question: {question}\n"
|
| 48 |
+
"Options:\n{options}\n"
|
| 49 |
+
"Answer with only the letter."
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
LETTER_RE = re.compile(r"\b([ABCD])\b", re.IGNORECASE)
|
| 53 |
+
ANSWER_LINE_RE = re.compile(r"Answer:\s*([ABCD])\b", re.IGNORECASE)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
class QueryFrames:
|
| 57 |
+
"""Query-aware frame selection over stock Qwen3-VL-2B-Instruct."""
|
| 58 |
+
|
| 59 |
+
def __init__(
|
| 60 |
+
self,
|
| 61 |
+
base_model: str = "Qwen/Qwen3-VL-2B-Instruct",
|
| 62 |
+
clip_model: str = "openai/clip-vit-large-patch14",
|
| 63 |
+
device: str = "auto",
|
| 64 |
+
max_pixels: int = 262_144,
|
| 65 |
+
max_new_tokens: int = 8,
|
| 66 |
+
n_frames: int = 8,
|
| 67 |
+
n_candidates: int = 32,
|
| 68 |
+
):
|
| 69 |
+
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
|
| 70 |
+
self.device = self._resolve_device(device)
|
| 71 |
+
self.n_frames = n_frames
|
| 72 |
+
self.n_candidates = n_candidates
|
| 73 |
+
self.max_new_tokens = max_new_tokens
|
| 74 |
+
|
| 75 |
+
from transformers import (
|
| 76 |
+
AutoProcessor, Qwen3VLForConditionalGeneration,
|
| 77 |
+
CLIPModel, CLIPProcessor,
|
| 78 |
+
)
|
| 79 |
+
self.qwen_processor = AutoProcessor.from_pretrained(base_model, max_pixels=max_pixels)
|
| 80 |
+
self.qwen_model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 81 |
+
base_model, dtype=torch.bfloat16,
|
| 82 |
+
).to(self.device).eval()
|
| 83 |
+
self.clip_model = CLIPModel.from_pretrained(
|
| 84 |
+
clip_model, torch_dtype=torch.float32,
|
| 85 |
+
).to(self.device).eval()
|
| 86 |
+
self.clip_processor = CLIPProcessor.from_pretrained(clip_model)
|
| 87 |
+
|
| 88 |
+
@staticmethod
|
| 89 |
+
def _resolve_device(device: str) -> str:
|
| 90 |
+
if device == "auto":
|
| 91 |
+
if torch.backends.mps.is_available():
|
| 92 |
+
return "mps"
|
| 93 |
+
if torch.cuda.is_available():
|
| 94 |
+
return "cuda"
|
| 95 |
+
return "cpu"
|
| 96 |
+
return device
|
| 97 |
+
|
| 98 |
+
def sample_uniform_candidates(self, video_path: str | Path) -> list[Image.Image]:
|
| 99 |
+
"""Sample ``n_candidates`` uniformly-spaced frames as PIL images."""
|
| 100 |
+
import decord
|
| 101 |
+
vid = decord.VideoReader(str(video_path))
|
| 102 |
+
total = len(vid)
|
| 103 |
+
step = total / (self.n_candidates + 1)
|
| 104 |
+
indices = [int((i + 1) * step) for i in range(self.n_candidates)]
|
| 105 |
+
return [Image.fromarray(vid[i].asnumpy()) for i in indices]
|
| 106 |
+
|
| 107 |
+
def select_frames(
|
| 108 |
+
self,
|
| 109 |
+
candidates: list[Image.Image],
|
| 110 |
+
question: str,
|
| 111 |
+
) -> list[Image.Image]:
|
| 112 |
+
"""Return ``n_frames`` images: top-K by CLIP similarity to question,
|
| 113 |
+
sorted by original temporal index (preserving sequence)."""
|
| 114 |
+
inputs = self.clip_processor(
|
| 115 |
+
text=[question], images=candidates,
|
| 116 |
+
return_tensors="pt", padding=True, truncation=True,
|
| 117 |
+
)
|
| 118 |
+
inputs = {k: v.to(self.device) for k, v in inputs.items()}
|
| 119 |
+
with torch.inference_mode():
|
| 120 |
+
text_emb = self.clip_model.get_text_features(
|
| 121 |
+
input_ids=inputs["input_ids"],
|
| 122 |
+
attention_mask=inputs["attention_mask"],
|
| 123 |
+
)
|
| 124 |
+
image_embs = self.clip_model.get_image_features(
|
| 125 |
+
pixel_values=inputs["pixel_values"]
|
| 126 |
+
)
|
| 127 |
+
text_emb = F.normalize(text_emb, dim=-1)
|
| 128 |
+
image_embs = F.normalize(image_embs, dim=-1)
|
| 129 |
+
sims = (text_emb @ image_embs.T).squeeze(0).float().cpu()
|
| 130 |
+
topk = sims.topk(self.n_frames).indices.tolist()
|
| 131 |
+
topk_sorted = sorted(topk)
|
| 132 |
+
return [candidates[i] for i in topk_sorted]
|
| 133 |
+
|
| 134 |
+
def select_uniform(self, candidates: list[Image.Image]) -> list[Image.Image]:
|
| 135 |
+
"""Return ``n_frames`` images sampled uniformly from candidates."""
|
| 136 |
+
step = len(candidates) / self.n_frames
|
| 137 |
+
idx = [int((k + 0.5) * step) for k in range(self.n_frames)]
|
| 138 |
+
idx = [min(i, len(candidates) - 1) for i in idx]
|
| 139 |
+
return [candidates[i] for i in idx]
|
| 140 |
+
|
| 141 |
+
def answer_mcq(
|
| 142 |
+
self,
|
| 143 |
+
video_path: str | Path,
|
| 144 |
+
question: str,
|
| 145 |
+
options: list[str],
|
| 146 |
+
task_type: Optional[str] = None,
|
| 147 |
+
) -> dict:
|
| 148 |
+
"""Answer one MCQ question on a video.
|
| 149 |
+
|
| 150 |
+
Args:
|
| 151 |
+
video_path: path to .mp4 (or any decord-readable video)
|
| 152 |
+
question: string question (no options)
|
| 153 |
+
options: list of 4 option strings (will be lettered A-D)
|
| 154 |
+
task_type: optional task category. If provided and matches
|
| 155 |
+
a known no-frame-gain task, falls back to
|
| 156 |
+
uniform sampling for collision-safe behavior.
|
| 157 |
+
|
| 158 |
+
Returns:
|
| 159 |
+
dict with keys: pred (letter), raw (model output),
|
| 160 |
+
frames_used ("query_aware" | "uniform_fallback"),
|
| 161 |
+
n_candidates, latency_clip_s, latency_gen_s.
|
| 162 |
+
"""
|
| 163 |
+
import time
|
| 164 |
+
candidates = self.sample_uniform_candidates(video_path)
|
| 165 |
+
|
| 166 |
+
# Decide policy.
|
| 167 |
+
use_uniform = task_type in NO_FRAME_GAIN_TASKS
|
| 168 |
+
t1 = time.time()
|
| 169 |
+
if use_uniform:
|
| 170 |
+
frames = self.select_uniform(candidates)
|
| 171 |
+
else:
|
| 172 |
+
frames = self.select_frames(candidates, question)
|
| 173 |
+
clip_dt = time.time() - t1
|
| 174 |
+
|
| 175 |
+
# Build Qwen prompt and run inference.
|
| 176 |
+
opts_text = "\n".join(f"{chr(65+i)}. {str(o).strip()}"
|
| 177 |
+
for i, o in enumerate(options))
|
| 178 |
+
prompt = PROMPT_TEMPLATE.format(question=question, options=opts_text)
|
| 179 |
+
messages = [{"role": "user", "content":
|
| 180 |
+
[{"type": "image"} for _ in frames]
|
| 181 |
+
+ [{"type": "text", "text": prompt}]}]
|
| 182 |
+
text_in = self.qwen_processor.apply_chat_template(
|
| 183 |
+
messages, tokenize=False, add_generation_prompt=True,
|
| 184 |
+
)
|
| 185 |
+
inputs = self.qwen_processor(
|
| 186 |
+
text=[text_in], images=frames,
|
| 187 |
+
return_tensors="pt", padding=True,
|
| 188 |
+
)
|
| 189 |
+
inputs = {k: v.to(self.device) for k, v in inputs.items()}
|
| 190 |
+
t2 = time.time()
|
| 191 |
+
with torch.inference_mode():
|
| 192 |
+
out_ids = self.qwen_model.generate(
|
| 193 |
+
**inputs,
|
| 194 |
+
max_new_tokens=self.max_new_tokens,
|
| 195 |
+
do_sample=False,
|
| 196 |
+
temperature=1.0,
|
| 197 |
+
)
|
| 198 |
+
gen_dt = time.time() - t2
|
| 199 |
+
new_tokens = out_ids[0, inputs["input_ids"].shape[1]:]
|
| 200 |
+
raw = self.qwen_processor.tokenizer.decode(
|
| 201 |
+
new_tokens, skip_special_tokens=True,
|
| 202 |
+
)
|
| 203 |
+
pred = self._extract_letter(raw)
|
| 204 |
+
return {
|
| 205 |
+
"pred": pred,
|
| 206 |
+
"raw": raw,
|
| 207 |
+
"frames_used": "uniform_fallback" if use_uniform else "query_aware",
|
| 208 |
+
"n_candidates": self.n_candidates,
|
| 209 |
+
"latency_clip_s": round(clip_dt, 3),
|
| 210 |
+
"latency_gen_s": round(gen_dt, 3),
|
| 211 |
+
}
|
| 212 |
+
|
| 213 |
+
@staticmethod
|
| 214 |
+
def _extract_letter(text: str) -> Optional[str]:
|
| 215 |
+
s = text or ""
|
| 216 |
+
m = ANSWER_LINE_RE.search(s)
|
| 217 |
+
if m:
|
| 218 |
+
return m.group(1).upper()
|
| 219 |
+
m = LETTER_RE.search(s)
|
| 220 |
+
return m.group(1).upper() if m else None
|
| 221 |
+
|
| 222 |
+
|
| 223 |
+
__all__ = ["QueryFrames", "NO_FRAME_GAIN_TASKS"]
|
eval_videomme.py
ADDED
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Standalone Video-MME mini eval for DW-KhotTaeVL-2B-QueryFrames.
|
| 2 |
+
|
| 3 |
+
This script reproduces the wild-mode QA-frame numbers reported in the
|
| 4 |
+
model card. It is fully self-contained — only depends on the
|
| 5 |
+
`dw_queryframes.py` module shipped in this same directory plus
|
| 6 |
+
publicly-available datasets / models from Hugging Face.
|
| 7 |
+
|
| 8 |
+
Usage::
|
| 9 |
+
|
| 10 |
+
pip install torch transformers pillow decord huggingface_hub pandas pyarrow
|
| 11 |
+
|
| 12 |
+
# Wild mode (query-aware frame selection)
|
| 13 |
+
python eval_videomme.py --mode wild --n-questions 50
|
| 14 |
+
|
| 15 |
+
# Stock baseline (uniform 8 frames; matches the stock numbers
|
| 16 |
+
# in the model card)
|
| 17 |
+
python eval_videomme.py --mode stock-uniform --n-questions 50
|
| 18 |
+
|
| 19 |
+
For benchmark-mode evaluation (uses Video-MME's own task_type label
|
| 20 |
+
to pick uniform-fallback for Object/Temporal Reasoning), run both
|
| 21 |
+
modes above then combine via ``build_hybrid.py``.
|
| 22 |
+
|
| 23 |
+
Outputs JSON with ``summary`` + ``results`` keys.
|
| 24 |
+
"""
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import argparse
|
| 28 |
+
import json
|
| 29 |
+
import os
|
| 30 |
+
import re
|
| 31 |
+
import sys
|
| 32 |
+
import time
|
| 33 |
+
import zipfile
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
|
| 36 |
+
import pandas as pd
|
| 37 |
+
from huggingface_hub import hf_hub_download
|
| 38 |
+
from PIL import Image
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
# Public Video-MME mini assets (lmms-lab/Video-MME on Hugging Face).
|
| 43 |
+
# ---------------------------------------------------------------------------
|
| 44 |
+
REPO_ID = "lmms-lab/Video-MME"
|
| 45 |
+
REPO_TYPE = "dataset"
|
| 46 |
+
DEFAULT_CHUNKS = ["videos_chunked_01.zip"]
|
| 47 |
+
PARQUET_NAME = "videomme/test-00000-of-00001.parquet"
|
| 48 |
+
|
| 49 |
+
# Cache lives next to this script so a fresh ``git clone`` of the HF
|
| 50 |
+
# repo can reproduce results without touching the user's home directory.
|
| 51 |
+
CACHE_DIR = Path(__file__).resolve().parent / "cache" / "videomme_mini"
|
| 52 |
+
CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
| 53 |
+
|
| 54 |
+
PROMPT_TEMPLATE = (
|
| 55 |
+
"This is a representative frame from a video.\n"
|
| 56 |
+
"Select the best answer based on the video.\n\n"
|
| 57 |
+
"Question: {question}\n"
|
| 58 |
+
"Options:\n{options}\n"
|
| 59 |
+
"Answer with only the letter."
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
ANSWER_RE = re.compile(r"\b([ABCD])\b", re.IGNORECASE)
|
| 63 |
+
ALPTD_ANSWER_RE = re.compile(r"Answer:\s*([ABCD])\b", re.IGNORECASE)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# ---------------------------------------------------------------------------
|
| 67 |
+
# Asset management — fetch + unzip into CACHE_DIR.
|
| 68 |
+
# ---------------------------------------------------------------------------
|
| 69 |
+
def download_assets(chunks: list[str]) -> tuple[Path, list[Path]]:
|
| 70 |
+
print(f"[eval] ensuring {PARQUET_NAME} ...")
|
| 71 |
+
pq_path = Path(hf_hub_download(
|
| 72 |
+
repo_id=REPO_ID, repo_type=REPO_TYPE, filename=PARQUET_NAME,
|
| 73 |
+
cache_dir=str(CACHE_DIR / "hf"),
|
| 74 |
+
))
|
| 75 |
+
zip_paths: list[Path] = []
|
| 76 |
+
for name in chunks:
|
| 77 |
+
zp = Path(hf_hub_download(
|
| 78 |
+
repo_id=REPO_ID, repo_type=REPO_TYPE, filename=name,
|
| 79 |
+
cache_dir=str(CACHE_DIR / "hf"),
|
| 80 |
+
))
|
| 81 |
+
zip_paths.append(zp)
|
| 82 |
+
return pq_path, zip_paths
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def unzip_chunks(zip_paths: list[Path]) -> Path:
|
| 86 |
+
video_dir = CACHE_DIR / "video"
|
| 87 |
+
video_dir.mkdir(parents=True, exist_ok=True)
|
| 88 |
+
for zp in zip_paths:
|
| 89 |
+
existing = {p.stem for p in video_dir.glob("*.mp4")}
|
| 90 |
+
with zipfile.ZipFile(zp, "r") as zf:
|
| 91 |
+
to_extract = [
|
| 92 |
+
m for m in zf.namelist()
|
| 93 |
+
if m.endswith(".mp4") and Path(m).stem not in existing
|
| 94 |
+
]
|
| 95 |
+
if to_extract:
|
| 96 |
+
print(f"[eval] extracting {len(to_extract)} mp4s from {zp.name}")
|
| 97 |
+
for m in to_extract:
|
| 98 |
+
with zf.open(m) as src, open(video_dir / Path(m).name, "wb") as dst:
|
| 99 |
+
dst.write(src.read())
|
| 100 |
+
return video_dir
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def load_questions(pq_path: Path, video_dir: Path, limit: int) -> pd.DataFrame:
|
| 104 |
+
df = pd.read_parquet(pq_path)
|
| 105 |
+
ids = {p.stem for p in video_dir.glob("*.mp4")}
|
| 106 |
+
df = df[df["videoID"].isin(ids)].reset_index(drop=True)
|
| 107 |
+
if limit > 0 and len(df) > limit:
|
| 108 |
+
df = df.iloc[:limit].copy()
|
| 109 |
+
print(f"[eval] using {len(df)} questions")
|
| 110 |
+
return df
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def format_options(options) -> str:
|
| 114 |
+
return "\n".join(str(o).strip() for o in options)
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
def extract_letter(text: str) -> str | None:
|
| 118 |
+
s = text or ""
|
| 119 |
+
m = ALPTD_ANSWER_RE.search(s)
|
| 120 |
+
if m:
|
| 121 |
+
return m.group(1).upper()
|
| 122 |
+
m = ANSWER_RE.search(s)
|
| 123 |
+
return m.group(1).upper() if m else None
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
# ---------------------------------------------------------------------------
|
| 127 |
+
# Frame selection lives in the local QueryFrames module.
|
| 128 |
+
# ---------------------------------------------------------------------------
|
| 129 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
| 130 |
+
from dw_queryframes import QueryFrames # noqa: E402
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def main() -> int:
|
| 134 |
+
ap = argparse.ArgumentParser()
|
| 135 |
+
ap.add_argument("--base", default="Qwen/Qwen3-VL-2B-Instruct")
|
| 136 |
+
ap.add_argument("--clip-model", default="openai/clip-vit-large-patch14")
|
| 137 |
+
ap.add_argument("--mode", choices=["wild", "stock-uniform"],
|
| 138 |
+
default="wild",
|
| 139 |
+
help="'wild' = query-aware (top-K of N candidates); "
|
| 140 |
+
"'stock-uniform' = stock baseline (uniform 8 frames)")
|
| 141 |
+
ap.add_argument("--tag", default="")
|
| 142 |
+
ap.add_argument("--n-questions", type=int, default=50)
|
| 143 |
+
ap.add_argument("--n-frames", type=int, default=8)
|
| 144 |
+
ap.add_argument("--n-candidates", type=int, default=32)
|
| 145 |
+
ap.add_argument("--max-pixels", type=int, default=262144)
|
| 146 |
+
ap.add_argument("--max-new-tokens", type=int, default=8)
|
| 147 |
+
ap.add_argument("--out-json", default=None,
|
| 148 |
+
help="output JSON path (auto-named if omitted)")
|
| 149 |
+
ap.add_argument("--chunks", nargs="+", default=DEFAULT_CHUNKS)
|
| 150 |
+
args = ap.parse_args()
|
| 151 |
+
|
| 152 |
+
pq_path, zip_paths = download_assets(args.chunks)
|
| 153 |
+
video_dir = unzip_chunks(zip_paths)
|
| 154 |
+
df = load_questions(pq_path, video_dir, args.n_questions)
|
| 155 |
+
|
| 156 |
+
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
|
| 157 |
+
|
| 158 |
+
fv = QueryFrames(
|
| 159 |
+
base_model=args.base,
|
| 160 |
+
clip_model=args.clip_model,
|
| 161 |
+
device="auto",
|
| 162 |
+
max_pixels=args.max_pixels,
|
| 163 |
+
max_new_tokens=args.max_new_tokens,
|
| 164 |
+
n_frames=args.n_frames,
|
| 165 |
+
n_candidates=args.n_candidates,
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
results = []
|
| 169 |
+
correct = 0
|
| 170 |
+
t0 = time.time()
|
| 171 |
+
for i, row in df.iterrows():
|
| 172 |
+
video_path = video_dir / f"{row['videoID']}.mp4"
|
| 173 |
+
|
| 174 |
+
# Wild mode = query-aware (task_type=None lets QA path run).
|
| 175 |
+
# Stock-uniform = pass a known no-frame-gain task name to force
|
| 176 |
+
# the uniform-fallback path (matches stock 8f
|
| 177 |
+
# baseline behavior).
|
| 178 |
+
forced_uniform = (args.mode == "stock-uniform")
|
| 179 |
+
out = fv.answer_mcq(
|
| 180 |
+
video_path=video_path,
|
| 181 |
+
question=row["question"],
|
| 182 |
+
options=list(row["options"]),
|
| 183 |
+
task_type=("Object Reasoning" if forced_uniform else None),
|
| 184 |
+
)
|
| 185 |
+
gold = row["answer"].strip().upper()
|
| 186 |
+
ok = out["pred"] == gold
|
| 187 |
+
correct += int(ok)
|
| 188 |
+
results.append({
|
| 189 |
+
"index": int(i),
|
| 190 |
+
"videoID": row["videoID"],
|
| 191 |
+
"task_type": row.get("task_type", ""),
|
| 192 |
+
"gold": gold,
|
| 193 |
+
"pred": out["pred"],
|
| 194 |
+
"raw": out["raw"][:200],
|
| 195 |
+
"frames_used": out["frames_used"],
|
| 196 |
+
"latency_clip_s": out["latency_clip_s"],
|
| 197 |
+
"latency_gen_s": out["latency_gen_s"],
|
| 198 |
+
"correct": ok,
|
| 199 |
+
})
|
| 200 |
+
run = correct / (i + 1)
|
| 201 |
+
print(f"[eval] [{i+1}/{len(df)}] gold={gold} pred={out['pred']} "
|
| 202 |
+
f"acc_so_far={run:.3f} clip={out['latency_clip_s']}s "
|
| 203 |
+
f"gen={out['latency_gen_s']}s", flush=True)
|
| 204 |
+
|
| 205 |
+
n = len(results)
|
| 206 |
+
acc = correct / n if n else 0.0
|
| 207 |
+
summary = {
|
| 208 |
+
"model_base": args.base,
|
| 209 |
+
"clip_model": args.clip_model,
|
| 210 |
+
"mode": args.mode,
|
| 211 |
+
"tag": args.tag,
|
| 212 |
+
"n_questions": n,
|
| 213 |
+
"n_frames": args.n_frames,
|
| 214 |
+
"n_candidates": args.n_candidates,
|
| 215 |
+
"max_pixels": args.max_pixels,
|
| 216 |
+
"max_new_tokens": args.max_new_tokens,
|
| 217 |
+
"accuracy": round(acc, 4),
|
| 218 |
+
"wall_time_s": round(time.time() - t0, 1),
|
| 219 |
+
}
|
| 220 |
+
|
| 221 |
+
out_path = args.out_json
|
| 222 |
+
if out_path is None:
|
| 223 |
+
tag = (args.tag or args.mode)
|
| 224 |
+
out_path = str(CACHE_DIR.parent / f"eval_{tag}_{n}q.json")
|
| 225 |
+
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
| 226 |
+
Path(out_path).write_text(json.dumps(
|
| 227 |
+
{"summary": summary, "results": results}, indent=2))
|
| 228 |
+
print(f"\n[eval] mode={args.mode} acc={acc:.4f} ({correct}/{n}) saved {out_path}")
|
| 229 |
+
return 0
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
if __name__ == "__main__":
|
| 233 |
+
sys.exit(main())
|
example_usage.py
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Example: run DW-KhotTaeVL-2B-QueryFrames on a single video MCQ.
|
| 2 |
+
|
| 3 |
+
Requirements::
|
| 4 |
+
|
| 5 |
+
pip install torch transformers pillow decord huggingface_hub
|
| 6 |
+
|
| 7 |
+
This script loads the QueryFrames wrapper, samples 32 candidate frames
|
| 8 |
+
from the input video, picks the 8 most relevant to the question via
|
| 9 |
+
CLIP-ViT-L/14, and answers via stock Qwen3-VL-2B-Instruct.
|
| 10 |
+
"""
|
| 11 |
+
from dw_queryframes import QueryFrames
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def main() -> None:
|
| 15 |
+
fv = QueryFrames(
|
| 16 |
+
base_model="Qwen/Qwen3-VL-2B-Instruct",
|
| 17 |
+
clip_model="openai/clip-vit-large-patch14",
|
| 18 |
+
device="auto",
|
| 19 |
+
n_frames=8,
|
| 20 |
+
n_candidates=32,
|
| 21 |
+
)
|
| 22 |
+
|
| 23 |
+
# Wild-mode example (no task taxonomy known).
|
| 24 |
+
result = fv.answer_mcq(
|
| 25 |
+
video_path="example.mp4",
|
| 26 |
+
question="What does the chef do after pouring the oil into the pot?",
|
| 27 |
+
options=[
|
| 28 |
+
"Chops fresh green herbs",
|
| 29 |
+
"Pours broth into the pot",
|
| 30 |
+
"Stirs the oil in the pot",
|
| 31 |
+
"Adds salt to the pot",
|
| 32 |
+
],
|
| 33 |
+
)
|
| 34 |
+
print("[wild mode]")
|
| 35 |
+
print(f" pred : {result['pred']}")
|
| 36 |
+
print(f" raw output : {result['raw']!r}")
|
| 37 |
+
print(f" frames used : {result['frames_used']}")
|
| 38 |
+
print(f" CLIP latency : {result['latency_clip_s']} s")
|
| 39 |
+
print(f" GEN latency : {result['latency_gen_s']} s")
|
| 40 |
+
|
| 41 |
+
# Task-aware example (when task taxonomy is provided, e.g. Video-MME).
|
| 42 |
+
result2 = fv.answer_mcq(
|
| 43 |
+
video_path="example.mp4",
|
| 44 |
+
question="What is happening to the cabbage in the frying pan?",
|
| 45 |
+
options=[
|
| 46 |
+
"It is being stirred",
|
| 47 |
+
"It is being chopped",
|
| 48 |
+
"It is being served",
|
| 49 |
+
"It is being washed",
|
| 50 |
+
],
|
| 51 |
+
task_type="Object Reasoning", # → uniform-fallback path
|
| 52 |
+
)
|
| 53 |
+
print("\n[task-aware mode]")
|
| 54 |
+
print(f" pred : {result2['pred']}")
|
| 55 |
+
print(f" frames used : {result2['frames_used']}") # 'uniform_fallback'
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
if __name__ == "__main__":
|
| 59 |
+
main()
|