Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - video | |
| - video-question-answering | |
| - multimodal | |
| - vision-language | |
| - qwen3-vl | |
| - inference-time | |
| - frame-selection | |
| - clip | |
| base_model: Qwen/Qwen3-VL-2B-Instruct | |
| pipeline_tag: video-text-to-text | |
| library_name: transformers | |
| # DW-KhotTaeVL-2B-QueryFrames | |
| **Built on [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) (Apache 2.0).** | |
| A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct | |
| for video multiple-choice / decision-style question answering. **No model | |
| weights are modified** — this method ships a CLIP-ViT-L/14-driven frame | |
| selector plus an optional task-type-aware uniform-fallback policy as a | |
| wrapper around the stock model. | |
| On Video-MME mini at 8-frame budget, this recovers **~44 % of the | |
| 8-frame → 64-frame stock baseline gap in MCQ mode, and ~56 % in | |
| task-aware MCQ mode**, with zero training, zero parameter changes, and | |
| ~+0.4 s overhead per question. | |
| ## Scope | |
| This release evaluates query-aware frame selection in a video | |
| multiple-choice / decision-style QA setting. The selector may use | |
| both the question text and the answer options as its CLIP query. | |
| This is appropriate for Video-MME-style MCQ benchmarks and for | |
| operational triage workflows where the system chooses among | |
| predefined actions or alert categories (e.g. *normal passage / | |
| restricted-zone entry / staff activity / false alarm*). It should | |
| **not** be read as an open-ended video-understanding benchmark claim. | |
| ## Motivation | |
| This work started from CCTV / video-security R&D, where only a small | |
| number of frames can be sent to a VLM under latency and compute | |
| constraints. The released artifact is a general-purpose query-aware | |
| frame selector for video MCQ / decision-style video QA — not a | |
| product-specific CCTV model. | |
| ## TL;DR | |
| | Method | trainable params | Video-MME mini 300 Q (8 frames) | Δ vs stock | | |
| |---|---:|---:|---:| | |
| | Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 | | |
| | **QueryFrames — MCQ mode** (no task_type) | 0 | **64.3 %** | **+7.3 pp** | | |
| | **QueryFrames — Task-aware MCQ mode** (task_type from dataset) | 0 | **66.3 %** | **+9.3 pp** | | |
| | Stock Qwen3-VL-2B (uniform 64 f) — ceiling | 0 | 73.7 % | +16.7 pp | | |
| **12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp); | |
| 0 regressions** in task-aware MCQ mode (task_type from Video-MME dataset). | |
| > **Scope note.** This method targets short-clip, low-frame-budget | |
| > video QA. The 300 Q numbers above are inside that design envelope. | |
| > On the full 2700 Q split, overall Δ is **+0.22 pp** — see | |
| > [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) below. | |
| ## Why it works | |
| Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp. | |
| The gap is *by definition* a frame-coverage problem (same model, same | |
| prompt, only frame budget changes). The bottleneck is **which 8 | |
| frames you give the model**, not the model itself. | |
| DW-KhotTaeVL-2B-QueryFrames picks the 8 frames *that match the | |
| question* via CLIP-ViT-L/14 cosine similarity. For two task types | |
| where 64-frame stock does *not* outperform 8-frame stock (Object | |
| Reasoning and Temporal Reasoning per the Video-MME taxonomy), the | |
| hybrid policy reverts to uniform sampling — frame coverage is not | |
| the bottleneck for those questions, and CLIP scoring can mis-pick. | |
| ## Pipeline | |
| ``` | |
| For each (video, question, options[A,B,C,D]): | |
| 1. Sample 32 uniformly-spaced candidate frames. | |
| 2. Encode question text with CLIP-ViT-L/14 → 768-d text vector. | |
| 3. Encode candidate frames → 768-d image vectors. | |
| 4. Cosine similarity → pick top-8 (or uniform-8 if task is | |
| Object Reasoning / Temporal Reasoning, when task_type is given). | |
| 5. Sort selected 8 frames by original temporal index. | |
| 6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct. | |
| 7. Extract letter from output. | |
| ``` | |
| ## Usage | |
| ### Install dependencies | |
| ```bash | |
| pip install torch transformers pillow decord huggingface_hub | |
| ``` | |
| ### Minimal example | |
| ```python | |
| from dw_queryframes import QueryFrames | |
| fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu | |
| result = fv.answer_mcq( | |
| video_path="cooking.mp4", | |
| question="What does the chef do after pouring the oil into the pot?", | |
| options=[ | |
| "Chops fresh green herbs", | |
| "Pours broth into the pot", | |
| "Stirs the oil in the pot", | |
| "Adds salt to the pot", | |
| ], | |
| task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode | |
| ) | |
| print(result["pred"]) # e.g. 'B' | |
| print(result["frames_used"]) # 'query_aware' or 'uniform_fallback' | |
| print(result["latency_clip_s"]) # ~0.4 s | |
| print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS | |
| ``` | |
| ### Two operating modes | |
| | Mode | Input | Use | Acc 300 Q | | |
| |---|---|---|---:| | |
| | **MCQ mode** (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | **64.3 %** | | |
| | **Task-aware MCQ mode** | + `task_type` string | benchmark or controlled workflows where task taxonomy is supplied | **66.3 %** | | |
| Pass any of the Video-MME task labels (e.g. `"Action Recognition"`, | |
| `"Object Reasoning"`, `"Counting Problem"`) to `task_type`. Two values | |
| trigger the uniform-fallback path: `"Object Reasoning"` and | |
| `"Temporal Reasoning"`. All other task strings (or `None`) use the | |
| query-aware path. | |
| > **MCQ mode without task_type (64.3 %, +7.3 pp)** is the default | |
| > reported setting: it uses only the video, question, and answer | |
| > options, with no task taxonomy. | |
| > | |
| > **Task-aware MCQ mode (66.3 %, +9.3 pp)** uses the `task_type` | |
| > label supplied by Video-MME to route Object Reasoning and Temporal | |
| > Reasoning questions to uniform sampling. This is a benchmark / | |
| > controlled-workflow setting and is reported separately from default | |
| > MCQ mode. | |
| ## Per-task accuracy on Video-MME mini 300 Q | |
| | Task | n | Stock 8 f | QueryFrames | Δ | | |
| |---|---:|---:|---:|---:| | |
| | Action Reasoning | 9 | 0.444 | 0.667 | **+0.222** ⭐ | | |
| | Action Recognition | 45 | 0.489 | 0.644 | **+0.156** ⭐ | | |
| | Attribute Perception | 37 | 0.730 | 0.811 | **+0.081** ⭐ | | |
| | Counting Problem | 34 | 0.265 | 0.353 | **+0.088** ⭐ | | |
| | Information Synopsis | 30 | 0.800 | 0.800 | +0.000 | | |
| | OCR Problems | 23 | 0.391 | 0.609 | **+0.217** ⭐ | | |
| | Object Reasoning | 36 | 0.722 | 0.722 | +0.000 | | |
| | Object Recognition | 51 | 0.588 | 0.667 | **+0.078** ⭐ | | |
| | Spatial Perception | 10 | 0.600 | 0.700 | **+0.100** ⭐ | | |
| | Spatial Reasoning | 9 | 0.778 | 1.000 | **+0.222** ⭐ | | |
| | Temporal Perception | 8 | 0.625 | 0.750 | **+0.125** ⭐ | | |
| | Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 | | |
| (Task-aware MCQ mode shown — task_type provided by Video-MME dataset. | |
| ⭐ = Δ ≥ 5 pp.) | |
| ## What this is NOT | |
| - It is **not** a fine-tuned model. Qwen3-VL-2B-Instruct weights are | |
| unchanged. You can verify with the standard Hugging Face model | |
| hash check. | |
| - It is **not** a leaderboard submission claim. The numbers above are | |
| on the publicly-available Video-MME mini split (300 Q, filtered to | |
| videos available locally via the standard mini chunks). | |
| - It is **not** a replacement for fine-tuning when you have abundant | |
| domain data. For domain-shifted deployments (e.g. surveillance | |
| video), training-based adaptation may be required. | |
| ## Hardware | |
| Runs on: | |
| | Device | Notes | | |
| |---|---| | |
| | Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM) | tested; ~3-4 s/q at 8 frames | | |
| | NVIDIA A100 / H100 (CUDA) | works; faster | | |
| | CPU (BF16-capable) | works but slow | | |
| VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with | |
| 8 frames. Lower `max_pixels` (e.g. to 153 600) if memory-constrained. | |
| ## Reproducibility | |
| All numbers in this card are reproducible from a fresh clone of this | |
| repo, using the [official Video-MME parquet](https://huggingface.co/datasets/lmms-lab/Video-MME) | |
| (filtered to its `videos_chunked_01.zip` mini split). | |
| The shipped scripts (`eval_videomme.py` and `build_hybrid.py`) are | |
| **self-contained** — they have no external project dependencies beyond | |
| the local `dw_queryframes.py` module and standard Python / | |
| Hugging Face / PyTorch packages. | |
| ### Three-command reproduction recipe | |
| ```bash | |
| # Install deps | |
| pip install torch transformers pillow decord huggingface_hub pandas pyarrow | |
| # 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json) | |
| python eval_videomme.py --mode stock-uniform --n-questions 300 \ | |
| --out-json stock_uniform_300q.json | |
| # 2. Reproduce QA-mode (no task_type) (writes wild_300q.json) | |
| python eval_videomme.py --mode wild --n-questions 300 \ | |
| --out-json wild_300q.json | |
| # 3. Combine into task-aware MCQ mode via the hybrid policy | |
| python build_hybrid.py \ | |
| --wild-json wild_300q.json \ | |
| --stock-uniform-json stock_uniform_300q.json \ | |
| --out-json hybrid_300q.json | |
| ``` | |
| Expected results at 300 Q (greedy decoding, `do_sample=False`, | |
| `max_pixels=262144`): | |
| | Output | Accuracy | Δ vs stock | | |
| |---|---:|---:| | |
| | `stock_uniform_300q.json` | 0.5700 | — | | |
| | `wild_300q.json` (MCQ mode) | 0.6433 | +7.3 pp | | |
| | `hybrid_300q.json` (task-aware MCQ mode) | 0.6633 | +9.3 pp | | |
| This artifact is **fully deterministic** at greedy decoding — | |
| re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 % | |
| in task-aware MCQ mode. | |
| > **Caveat — sample size and split.** The 300 Q numbers above are on | |
| > the `videos_chunked_01.zip` mini subset, which happens to be mostly | |
| > short clips. For full-split numbers on Video-MME mini 2700 Q | |
| > (balanced short / medium / long), see | |
| > [Scope on the full Video-MME mini (2700 Q)](#scope-on-the-full-video-mme-mini-2700-q) | |
| > below. This release is not a leaderboard submission. | |
| ## Scope on the full Video-MME mini (2700 Q) | |
| After the 300 Q release, the eval was extended to the full 2700 Q | |
| split (MCQ mode without `task_type`). Stock 53.11 %, QueryFrames | |
| 53.33 %, **Δ +0.22 pp**. | |
| This method targets short-clip, low-frame-budget video QA. The | |
| 2700 Q split is balanced across short / medium / long-form clips; | |
| averaging across that range dilutes the gain to roughly neutral. | |
| ## Acknowledgements / Related Work | |
| This project builds on Qwen3-VL-2B-Instruct and uses a simple | |
| CLIP-based query-aware frame selection policy at inference time. | |
| Query-aware and adaptive frame selection for Video-LLMs is an active | |
| research direction. This release is an independent, simple CLIP-based | |
| inference-time implementation focused on small-model video MCQ / | |
| decision-style video QA under tight frame budgets. | |
| ## License | |
| | Component | License | Source | | |
| |---|---|---| | |
| | This wrapper code | Apache 2.0 | this repo | | |
| | Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct | | |
| | Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 | | |
| | Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME | | |
| When using or citing this work, please credit the base model: | |
| > Built on Qwen3-VL-2B-Instruct (Apache 2.0). | |
| > Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT). | |
| ## Citation | |
| ```bibtex | |
| @misc{dw-khottaevl-2b-queryframes-2026, | |
| author = {Deaw}, | |
| title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection | |
| for Video MCQ on Qwen3-VL-2B-Instruct}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames} | |
| } | |
| @misc{qwen3vl2025, | |
| title = {Qwen3-VL: Multilingual Vision-Language Models}, | |
| author = {Qwen Team}, | |
| year = {2025}, | |
| } | |
| @inproceedings{radford2021clip, | |
| title = {Learning Transferable Visual Models From Natural Language Supervision}, | |
| author = {Radford, Alec and Kim, Jong Wook and others}, | |
| booktitle = {ICML}, | |
| year = {2021}, | |
| } | |
| @misc{videomme2024, | |
| title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark | |
| of Multi-modal LLMs in Video Analysis}, | |
| author = {Fu, Chaoyou and others}, | |
| year = {2024}, | |
| } | |
| ``` | |
| ## Author | |
| **Deaw** ([@commandeaw](https://huggingface.co/commandeaw)) — independent | |
| ML practitioner. Personal research release. | |
| Issues / questions: open an issue on the model repo. | |