Video-Text-to-Text
Transformers
English
video
video-question-answering
multimodal
vision-language
qwen3-vl
inference-time
frame-selection
clip
Instructions to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use commandeaw/DW-KhotTaeVL-2B-QueryFrames with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("commandeaw/DW-KhotTaeVL-2B-QueryFrames", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Add Acknowledgements / Related Work section
Browse filesAcknowledge that query-aware and adaptive frame selection for
Video-LLMs is an active research direction; clarify that this
release is an independent engineering implementation focused on
small-model, low-frame-budget video QA and CCTV-style deployment.
README.md
CHANGED
|
@@ -217,6 +217,16 @@ in benchmark mode.
|
|
| 217 |
> Video-MME benchmark and are not a leaderboard submission. A full-
|
| 218 |
> benchmark eval is on the future-work list.
|
| 219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
## License
|
| 221 |
|
| 222 |
| Component | License | Source |
|
|
|
|
| 217 |
> Video-MME benchmark and are not a leaderboard submission. A full-
|
| 218 |
> benchmark eval is on the future-work list.
|
| 219 |
|
| 220 |
+
## Acknowledgements / Related Work
|
| 221 |
+
|
| 222 |
+
This project builds on Qwen3-VL-2B-Instruct and uses a simple
|
| 223 |
+
CLIP-based query-aware frame selection policy at inference time.
|
| 224 |
+
|
| 225 |
+
Query-aware and adaptive frame selection for Video-LLMs is an active
|
| 226 |
+
research direction. DW-KhotTaeVL-2B-QueryFrames is an independent
|
| 227 |
+
engineering implementation focused on small-model, low-frame-budget
|
| 228 |
+
video QA and CCTV-style deployment constraints.
|
| 229 |
+
|
| 230 |
## License
|
| 231 |
|
| 232 |
| Component | License | Source |
|