Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Polish: distinguish wild-mode (~44%) vs benchmark-mode (~56%) gap recovery
Browse filesPer auditor feedback: the previous '56% of the gap' line was the
benchmark-mode number; spell out both modes' recovery fractions to
avoid ambiguity for readers who skim the lede.
README.md
CHANGED
|
@@ -26,9 +26,10 @@ modified** — this method ships a CLIP-ViT-L/14-driven frame selector
|
|
| 26 |
plus an optional task-type-aware uniform-fallback policy as a
|
| 27 |
wrapper around the stock model.
|
| 28 |
|
| 29 |
-
On Video-MME mini at 8-frame budget, this recovers **
|
| 30 |
-
8-frame → 64-frame stock baseline gap
|
| 31 |
-
|
|
|
|
| 32 |
|
| 33 |
## TL;DR
|
| 34 |
|
|
|
|
| 26 |
plus an optional task-type-aware uniform-fallback policy as a
|
| 27 |
wrapper around the stock model.
|
| 28 |
|
| 29 |
+
On Video-MME mini at 8-frame budget, this recovers **~44 % of the
|
| 30 |
+
8-frame → 64-frame stock baseline gap in wild mode, and ~56 % in
|
| 31 |
+
benchmark mode**, with zero training, zero parameter changes, and
|
| 32 |
+
~+0.4 s overhead per question.
|
| 33 |
|
| 34 |
## TL;DR
|
| 35 |
|