File size: 12,117 Bytes

---
license: apache-2.0
language:
- en
tags:
- video
- video-question-answering
- multimodal
- vision-language
- qwen3-vl
- inference-time
- frame-selection
- clip
base_model: Qwen/Qwen3-VL-2B-Instruct
pipeline_tag: video-text-to-text
library_name: transformers
---

# DW-KhotTaeVL-2B-QueryFrames

**Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).**

A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
for video multiple-choice / decision-style question answering. **No model
weights are modified** — this method ships a CLIP-ViT-L/14-driven frame
selector plus an optional task-type-aware uniform-fallback policy as a
wrapper around the stock model.

On Video-MME mini at 8-frame budget, this recovers **~44 % of the
8-frame → 64-frame stock baseline gap in MCQ mode, and ~56 % in
task-aware MCQ mode**, with zero training, zero parameter changes, and
~+0.4 s overhead per question.

## Scope

This release evaluates query-aware frame selection in a video
multiple-choice / decision-style QA setting. The selector may use
both the question text and the answer options as its CLIP query.
This is appropriate for Video-MME-style MCQ benchmarks and for
operational triage workflows where the system chooses among
predefined actions or alert categories (e.g. *normal passage /
restricted-zone entry / staff activity / false alarm*). It should
**not** be read as an open-ended video-understanding benchmark claim.

## Motivation

This work started from CCTV / video-security R&D, where only a small
number of frames can be sent to a VLM under latency and compute
constraints. The released artifact is a general-purpose query-aware
frame selector for video MCQ / decision-style video QA — not a
product-specific CCTV model.

## TL;DR

| Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock |
|---|---:|---:|---:|
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
| **QueryFrames — MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** |
| **QueryFrames — Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** |
| Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp |

**12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).

> **Scope note.** This method targets short-clip, low-frame-budget
> video QA. The 300 Q numbers above are inside that design envelope.
> On the full 2700 Q split, overall Δ is **+0.22 pp** — see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.

## Why it works

Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
The gap is *by definition* a frame-coverage problem (same model, same
prompt, only frame budget changes). The bottleneck is **which 8
frames you give the model**, not the model itself.

DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
question* via CLIP-ViT-L/14 cosine similarity. For two task types
where 64-frame stock does *not* outperform 8-frame stock (Object
Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
hybrid policy reverts to uniform sampling — frame coverage is not
the bottleneck for those questions, and CLIP scoring can mis-pick.

## Pipeline

```
For each (video, question, options[A,B,C,D]):
    1. Sample 32 uniformly-spaced candidate frames.
    2. Encode question text with CLIP-ViT-L/14 → 768-d text vector.
    3. Encode candidate frames → 768-d image vectors.
    4. Cosine similarity → pick top-8 (or uniform-8 if task is
       Object Reasoning / Temporal Reasoning, when task_type is given).
    5. Sort selected 8 frames by original temporal index.
    6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
    7. Extract letter from output.
```

## Usage

### Install dependencies

```bash
pip install torch transformers pillow decord huggingface_hub
```

### Minimal example

```python
from dw_queryframes import QueryFrames

fv = QueryFrames(device="auto")  # auto-resolves to cuda / mps / cpu

result = fv.answer_mcq(
    video_path="cooking.mp4",
    question="What does the chef do after pouring the oil into the pot?",
    options=[
        "Chops fresh green herbs",
        "Pours broth into the pot",
        "Stirs the oil in the pot",
        "Adds salt to the pot",
    ],
    task_type=None,  # or e.g. "Action Recognition" for task-aware MCQ mode
)
print(result["pred"])              # e.g. 'B'
print(result["frames_used"])       # 'query_aware' or 'uniform_fallback'
print(result["latency_clip_s"])    # ~0.4 s
print(result["latency_gen_s"])     # ~3 s on Apple M4 MPS
```

### Two operating modes

| Mode | Input | Use | Acc 300 Q |
|---|---|---|---:|
| **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** |
| **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** |

Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
trigger the uniform-fallback path: `"Object Reasoning"` and
`"Temporal Reasoning"`. All other task strings (or `None`) use the
query-aware path.

> **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default
> reported setting: it uses only the video, question, and answer
> options, with no task taxonomy.
>
> **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type`
> label supplied by Video-MME to route Object Reasoning and Temporal
> Reasoning questions to uniform sampling. This is a benchmark /
> controlled-workflow setting and is reported separately from default
> MCQ mode.

## Per-task accuracy on Video-MME mini 300 Q

| Task | n | Stock 8 f | QueryFrames | Δ |
|---|---:|---:|---:|---:|
| Action Reasoning      |  9 | 0.444 | 0.667 | **+0.222** ⭐ |
| Action Recognition    | 45 | 0.489 | 0.644 | **+0.156** ⭐ |
| Attribute Perception  | 37 | 0.730 | 0.811 | **+0.081** ⭐ |
| Counting Problem      | 34 | 0.265 | 0.353 | **+0.088** ⭐ |
| Information Synopsis  | 30 | 0.800 | 0.800 |  +0.000  |
| OCR Problems          | 23 | 0.391 | 0.609 | **+0.217** ⭐ |
| Object Reasoning      | 36 | 0.722 | 0.722 |  +0.000  |
| Object Recognition    | 51 | 0.588 | 0.667 | **+0.078** ⭐ |
| Spatial Perception    | 10 | 0.600 | 0.700 | **+0.100** ⭐ |
| Spatial Reasoning     |  9 | 0.778 | 1.000 | **+0.222** ⭐ |
| Temporal Perception   |  8 | 0.625 | 0.750 | **+0.125** ⭐ |
| Temporal Reasoning    |  8 | 0.250 | 0.250 |  +0.000  |

(Task-aware MCQ mode shown — task_type provided by Video-MME dataset.
⭐ = Δ ≥ 5 pp.)

## What this is NOT

- It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are
  unchanged. You can verify with the standard Hugging Face model
  hash check.
- It is **not** a leaderboard submission claim. The numbers above are
  on the publicly-available Video-MME mini split (300 Q, filtered to
  videos available locally via the standard mini chunks).
- It is **not** a replacement for fine-tuning when you have abundant
  domain data. For domain-shifted deployments (e.g. surveillance
  video), training-based adaptation may be required.

## Hardware

Runs on:

| Device | Notes |
|---|---|
| Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
| NVIDIA A100 / H100 (CUDA) | works; faster |
| CPU (BF16-capable) | works but slow |

VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.

## Reproducibility

All numbers in this card are reproducible from a fresh clone of this
repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
(filtered to its `videos_chunked_01.zip` mini split).

The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
**self-contained** — they have no external project dependencies beyond
the local `dw_queryframes.py` module and standard Python /
Hugging Face / PyTorch packages.

### Three-command reproduction recipe

```bash
# Install deps
pip install torch transformers pillow decord huggingface_hub pandas pyarrow

# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
python eval_videomme.py --mode stock-uniform --n-questions 300 \
    --out-json stock_uniform_300q.json

# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
python eval_videomme.py --mode wild --n-questions 300 \
    --out-json wild_300q.json

# 3. Combine into task-aware MCQ mode via the hybrid policy
python build_hybrid.py \
    --wild-json wild_300q.json \
    --stock-uniform-json stock_uniform_300q.json \
    --out-json hybrid_300q.json
```

Expected results at 300 Q (greedy decoding, `do_sample=False`,
`max_pixels=262144`):

| Output | Accuracy | Δ vs stock |
|---|---:|---:|
| `stock_uniform_300q.json` | 0.5700 | — |
| `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp |
| `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp |

This artifact is **fully deterministic** at greedy decoding —
re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
in task-aware MCQ mode.

> **Caveat — sample size and split.** The 300 Q numbers above are on
> the `videos_chunked_01.zip` mini subset, which happens to be mostly
> short clips. For full-split numbers on Video-MME mini 2700 Q
> (balanced short / medium / long), see
> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
> below. This release is not a leaderboard submission.

## Scope on the full Video-MME mini (2700 Q)

After the 300 Q release, the eval was extended to the full 2700 Q
split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
53.33 %, **Δ +0.22 pp**.

This method targets short-clip, low-frame-budget video QA. The
2700 Q split is balanced across short / medium / long-form clips;
averaging across that range dilutes the gain to roughly neutral.

## Acknowledgements / Related Work

This project builds on Qwen3-VL-2B-Instruct and uses a simple
CLIP-based query-aware frame selection policy at inference time.

Query-aware and adaptive frame selection for Video-LLMs is an active
research direction. This release is an independent, simple CLIP-based
inference-time implementation focused on small-model video MCQ /
decision-style video QA under tight frame budgets.

## License

| Component | License | Source |
|---|---|---|
| This wrapper code | Apache 2.0 | this repo |
| Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
| Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |

When using or citing this work, please credit the base model:

> Built on Qwen3-VL-2B-Instruct (Apache 2.0).
> Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).

## Citation

```bibtex
@misc{dw-khottaevl-2b-queryframes-2026,
  author = {Deaw},
  title  = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
            for Video MCQ on Qwen3-VL-2B-Instruct},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
}

@misc{qwen3vl2025,
  title  = {Qwen3-VL: Multilingual Vision-Language Models},
  author = {Qwen Team},
  year   = {2025},
}

@inproceedings{radford2021clip,
  title  = {Learning Transferable Visual Models From Natural Language Supervision},
  author = {Radford, Alec and Kim, Jong Wook and others},
  booktitle = {ICML},
  year   = {2021},
}

@misc{videomme2024,
  title  = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
            of Multi-modal LLMs in Video Analysis},
  author = {Fu, Chaoyou and others},
  year   = {2024},
}
```

## Author

**Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) — independent
ML practitioner. Personal research release.

Issues / questions: open an issue on the model repo.