Add scope section: full Video-MME mini 2700Q result (+0.22 pp)

d0f5738 verified about 1 month ago

12.1 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- video
	- video-question-answering
	- multimodal
	- vision-language
	- qwen3-vl
	- inference-time
	- frame-selection
	- clip
	base_model: Qwen/Qwen3-VL-2B-Instruct
	pipeline_tag: video-text-to-text
	library_name: transformers
	---

	# DW-KhotTaeVL-2B-QueryFrames

	Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).

	A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct
	for video multiple-choice / decision-style question answering. **No model
	weights are modified** — this method ships a CLIP-ViT-L/14-driven frame
	selector plus an optional task-type-aware uniform-fallback policy as a
	wrapper around the stock model.

	On Video-MME mini at 8-frame budget, this recovers **~44 % of the
	8-frame → 64-frame stock baseline gap in MCQ mode, and ~56 % in
	task-aware MCQ mode**, with zero training, zero parameter changes, and
	~+0.4 s overhead per question.

	## Scope

	This release evaluates query-aware frame selection in a video
	multiple-choice / decision-style QA setting. The selector may use
	both the question text and the answer options as its CLIP query.
	This is appropriate for Video-MME-style MCQ benchmarks and for
	operational triage workflows where the system chooses among
	predefined actions or alert categories (e.g. *normal passage /
	restricted-zone entry / staff activity / false alarm*). It should
	not be read as an open-ended video-understanding benchmark claim.

	## Motivation

	This work started from CCTV / video-security R&D, where only a small
	number of frames can be sent to a VLM under latency and compute
	constraints. The released artifact is a general-purpose query-aware
	frame selector for video MCQ / decision-style video QA — not a
	product-specific CCTV model.

	## TL;DR

	\| Method \| trainable params \| Video-MME mini 300 Q (8 frames) \| Δ vs stock \|
	\|---\|---:\|---:\|---:\|
	\| Stock Qwen3-VL-2B (uniform 8 f) \| 0 \| 57.0 % \| 0 \|
	\| QueryFrames — MCQ mode (no task_type) \| 0 \| 64.3 % \| +7.3 pp \|
	\| QueryFrames — Task-aware MCQ mode (task_type from dataset) \| 0 \| 66.3 % \| +9.3 pp \|
	\| Stock Qwen3-VL-2B (uniform 64 f) — ceiling \| 0 \| 73.7 % \| +16.7 pp \|

	**12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp);
	0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset).

	> Scope note. This method targets short-clip, low-frame-budget
	> video QA. The 300 Q numbers above are inside that design envelope.
	> On the full 2700 Q split, overall Δ is +0.22 pp — see
	> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below.

	## Why it works

	Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp.
	The gap is by definition a frame-coverage problem (same model, same
	prompt, only frame budget changes). The bottleneck is **which 8
	frames you give the model**, not the model itself.

	DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the
	question* via CLIP-ViT-L/14 cosine similarity. For two task types
	where 64-frame stock does not outperform 8-frame stock (Object
	Reasoning and Temporal Reasoning per the Video-MME taxonomy), the
	hybrid policy reverts to uniform sampling — frame coverage is not
	the bottleneck for those questions, and CLIP scoring can mis-pick.

	## Pipeline

	```
	For each (video, question, options[A,B,C,D]):
	1. Sample 32 uniformly-spaced candidate frames.
	2. Encode question text with CLIP-ViT-L/14 → 768-d text vector.
	3. Encode candidate frames → 768-d image vectors.
	4. Cosine similarity → pick top-8 (or uniform-8 if task is
	Object Reasoning / Temporal Reasoning, when task_type is given).
	5. Sort selected 8 frames by original temporal index.
	6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
	7. Extract letter from output.
	```

	## Usage

	### Install dependencies

	```bash
	pip install torch transformers pillow decord huggingface_hub
	```

	### Minimal example

	```python
	from dw_queryframes import QueryFrames

	fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu

	result = fv.answer_mcq(
	video_path="cooking.mp4",
	question="What does the chef do after pouring the oil into the pot?",
	options=[
	"Chops fresh green herbs",
	"Pours broth into the pot",
	"Stirs the oil in the pot",
	"Adds salt to the pot",
	],
	task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode
	)
	print(result["pred"]) # e.g. 'B'
	print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
	print(result["latency_clip_s"]) # ~0.4 s
	print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
	```

	### Two operating modes

	\| Mode \| Input \| Use \| Acc 300 Q \|
	\|---\|---\|---\|---:\|
	\| MCQ mode (no task_type) \| video + question + answer options \| Video-MCQ / decision-style QA without task taxonomy \| 64.3 % \|
	\| Task-aware MCQ mode \| + `task_type` string \| benchmark or controlled workflows where task taxonomy is supplied \| 66.3 % \|

	Pass any of the Video-MME task labels (e.g. `"Action Recognition"`,
	`"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values
	trigger the uniform-fallback path: `"Object Reasoning"` and
	`"Temporal Reasoning"`. All other task strings (or `None`) use the
	query-aware path.

	> MCQ mode without task_type (64.3 %, +7.3 pp) is the default
	> reported setting: it uses only the video, question, and answer
	> options, with no task taxonomy.
	>
	> Task-aware MCQ mode (66.3 %, +9.3 pp) uses the `task_type`
	> label supplied by Video-MME to route Object Reasoning and Temporal
	> Reasoning questions to uniform sampling. This is a benchmark /
	> controlled-workflow setting and is reported separately from default
	> MCQ mode.

	## Per-task accuracy on Video-MME mini 300 Q

	\| Task \| n \| Stock 8 f \| QueryFrames \| Δ \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Action Reasoning \| 9 \| 0.444 \| 0.667 \| +0.222 ⭐ \|
	\| Action Recognition \| 45 \| 0.489 \| 0.644 \| +0.156 ⭐ \|
	\| Attribute Perception \| 37 \| 0.730 \| 0.811 \| +0.081 ⭐ \|
	\| Counting Problem \| 34 \| 0.265 \| 0.353 \| +0.088 ⭐ \|
	\| Information Synopsis \| 30 \| 0.800 \| 0.800 \| +0.000 \|
	\| OCR Problems \| 23 \| 0.391 \| 0.609 \| +0.217 ⭐ \|
	\| Object Reasoning \| 36 \| 0.722 \| 0.722 \| +0.000 \|
	\| Object Recognition \| 51 \| 0.588 \| 0.667 \| +0.078 ⭐ \|
	\| Spatial Perception \| 10 \| 0.600 \| 0.700 \| +0.100 ⭐ \|
	\| Spatial Reasoning \| 9 \| 0.778 \| 1.000 \| +0.222 ⭐ \|
	\| Temporal Perception \| 8 \| 0.625 \| 0.750 \| +0.125 ⭐ \|
	\| Temporal Reasoning \| 8 \| 0.250 \| 0.250 \| +0.000 \|

	(Task-aware MCQ mode shown — task_type provided by Video-MME dataset.
	⭐ = Δ ≥ 5 pp.)

	## What this is NOT

	- It is not a fine-tuned model. Qwen3-VL-2B-Instruct weights are
	unchanged. You can verify with the standard Hugging Face model
	hash check.
	- It is not a leaderboard submission claim. The numbers above are
	on the publicly-available Video-MME mini split (300 Q, filtered to
	videos available locally via the standard mini chunks).
	- It is not a replacement for fine-tuning when you have abundant
	domain data. For domain-shifted deployments (e.g. surveillance
	video), training-based adaptation may be required.

	## Hardware

	Runs on:

	\| Device \| Notes \|
	\|---\|---\|
	\| Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM) \| tested; ~3-4 s/q at 8 frames \|
	\| NVIDIA A100 / H100 (CUDA) \| works; faster \|
	\| CPU (BF16-capable) \| works but slow \|

	VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
	8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained.

	## Reproducibility

	All numbers in this card are reproducible from a fresh clone of this
	repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME)
	(filtered to its `videos_chunked_01.zip` mini split).

	The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are
	self-contained — they have no external project dependencies beyond
	the local `dw_queryframes.py` module and standard Python /
	Hugging Face / PyTorch packages.

	### Three-command reproduction recipe

	```bash
	# Install deps
	pip install torch transformers pillow decord huggingface_hub pandas pyarrow

	# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
	python eval_videomme.py --mode stock-uniform --n-questions 300 \
	--out-json stock_uniform_300q.json

	# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
	python eval_videomme.py --mode wild --n-questions 300 \
	--out-json wild_300q.json

	# 3. Combine into task-aware MCQ mode via the hybrid policy
	python build_hybrid.py \
	--wild-json wild_300q.json \
	--stock-uniform-json stock_uniform_300q.json \
	--out-json hybrid_300q.json
	```

	Expected results at 300 Q (greedy decoding, `do_sample=False`,
	`max_pixels=262144`):

	\| Output \| Accuracy \| Δ vs stock \|
	\|---\|---:\|---:\|
	\| `stock_uniform_300q.json` \| 0.5700 \| — \|
	\| `wild_300q.json` (MCQ mode) \| 0.6433 \| +7.3 pp \|
	\| `hybrid_300q.json` (task-aware MCQ mode) \| 0.6633 \| +9.3 pp \|

	This artifact is fully deterministic at greedy decoding —
	re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 %
	in task-aware MCQ mode.

	> Caveat — sample size and split. The 300 Q numbers above are on
	> the `videos_chunked_01.zip` mini subset, which happens to be mostly
	> short clips. For full-split numbers on Video-MME mini 2700 Q
	> (balanced short / medium / long), see
	> [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q)
	> below. This release is not a leaderboard submission.

	## Scope on the full Video-MME mini (2700 Q)

	After the 300 Q release, the eval was extended to the full 2700 Q
	split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames
	53.33 %, Δ +0.22 pp.

	This method targets short-clip, low-frame-budget video QA. The
	2700 Q split is balanced across short / medium / long-form clips;
	averaging across that range dilutes the gain to roughly neutral.

	## Acknowledgements / Related Work

	This project builds on Qwen3-VL-2B-Instruct and uses a simple
	CLIP-based query-aware frame selection policy at inference time.

	Query-aware and adaptive frame selection for Video-LLMs is an active
	research direction. This release is an independent, simple CLIP-based
	inference-time implementation focused on small-model video MCQ /
	decision-style video QA under tight frame budgets.

	## License

	\| Component \| License \| Source \|
	\|---\|---\|---\|
	\| This wrapper code \| Apache 2.0 \| this repo \|
	\| Base model (Qwen3-VL-2B-Instruct) \| Apache 2.0 \| https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct \|
	\| Frame scorer (CLIP-ViT-L/14) \| MIT \| https://huggingface.co/openai/clip-vit-large-patch14 \|
	\| Eval data (Video-MME mini) \| as published by lmms-lab \| https://huggingface.co/datasets/lmms-lab/Video-MME \|

	When using or citing this work, please credit the base model:

	> Built on Qwen3-VL-2B-Instruct (Apache 2.0).
	> Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).

	## Citation

	```bibtex
	@misc{dw-khottaevl-2b-queryframes-2026,
	author = {Deaw},
	title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
	for Video MCQ on Qwen3-VL-2B-Instruct},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
	}

	@misc{qwen3vl2025,
	title = {Qwen3-VL: Multilingual Vision-Language Models},
	author = {Qwen Team},
	year = {2025},
	}

	@inproceedings{radford2021clip,
	title = {Learning Transferable Visual Models From Natural Language Supervision},
	author = {Radford, Alec and Kim, Jong Wook and others},
	booktitle = {ICML},
	year = {2021},
	}

	@misc{videomme2024,
	title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
	of Multi-modal LLMs in Video Analysis},
	author = {Fu, Chaoyou and others},
	year = {2024},
	}
	```

	## Author

	Deaw ([@commandeaw](https://huggingface.co/commandeaw)) — independent
	ML practitioner. Personal research release.

	Issues / questions: open an issue on the model repo.