Spaces:
Running on Zero
Running on Zero
| title: TASKER Keyframe Extractor | |
| emoji: 🔍 | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 6.15.1 | |
| app_file: app.py | |
| short_description: VLM-guided tree-search keyframe extraction from videos | |
| python_version: "3.12" | |
| startup_duration_timeout: 30m | |
| ## TASKER Keyframe Extractor | |
| This Space demonstrates **TASKER** (**Ta**sk-driven **a**nd **S**cene-aware **Ke**yframe sea**r**cher), a keyframe extraction algorithm from the ECCV 2026 paper [Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction](https://arxiv.org/abs/2606.29445). | |
| ### How it works | |
| TASKER reformulates keyframe extraction as a **generalized graph-search problem**: | |
| 1. The input video is segmented into a tree of segments. | |
| 2. A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions. | |
| 3. The selected segments are expanded (split at visual change points). | |
| 4. Visual deduplication filters near-identical frames. | |
| 5. The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached. | |
| Four search strategies are available: | |
| - **A\*** (default): balances goal-relevance and visual state changes | |
| - **BFS**: broad exploration, can select multiple segments per step | |
| - **GBFS**: greedy best-first, focuses on goal-critical actions | |
| - **Dijkstra**: focuses on maximum visual state transitions | |
| ### Usage | |
| 1. Upload a video file | |
| 2. Enter a task query (e.g., "How to send an email with an attachment?") | |
| 3. Select a search strategy | |
| 4. Click "Extract Keyframes" | |
| The model returns a gallery of keyframes with timestamps and frame indices. | |
| ### Model | |
| Uses [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as the VLM for segment evaluation, running on ZeroGPU. |