multimodalart's picture
multimodalart HF Staff
Upload folder using huggingface_hub
2f54371 verified
|
Raw
History Blame Contribute Delete
1.77 kB
---
title: TASKER Keyframe Extractor
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.15.1
app_file: app.py
short_description: VLM-guided tree-search keyframe extraction from videos
python_version: "3.12"
startup_duration_timeout: 30m
---
## TASKER Keyframe Extractor
This Space demonstrates **TASKER** (**Ta**sk-driven **a**nd **S**cene-aware **Ke**yframe sea**r**cher), a keyframe extraction algorithm from the ECCV 2026 paper [Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction](https://arxiv.org/abs/2606.29445).
### How it works
TASKER reformulates keyframe extraction as a **generalized graph-search problem**:
1. The input video is segmented into a tree of segments.
2. A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions.
3. The selected segments are expanded (split at visual change points).
4. Visual deduplication filters near-identical frames.
5. The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached.
Four search strategies are available:
- **A\*** (default): balances goal-relevance and visual state changes
- **BFS**: broad exploration, can select multiple segments per step
- **GBFS**: greedy best-first, focuses on goal-critical actions
- **Dijkstra**: focuses on maximum visual state transitions
### Usage
1. Upload a video file
2. Enter a task query (e.g., "How to send an email with an attachment?")
3. Select a search strategy
4. Click "Extract Keyframes"
The model returns a gallery of keyframes with timestamps and frame indices.
### Model
Uses [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as the VLM for segment evaluation, running on ZeroGPU.