--- title: TASKER Keyframe Extractor emoji: 🔍 colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 6.15.1 app_file: app.py short_description: VLM-guided tree-search keyframe extraction from videos python_version: "3.12" startup_duration_timeout: 30m --- ## TASKER Keyframe Extractor This Space demonstrates **TASKER** (**Ta**sk-driven **a**nd **S**cene-aware **Ke**yframe sea**r**cher), a keyframe extraction algorithm from the ECCV 2026 paper [Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction](https://arxiv.org/abs/2606.29445). ### How it works TASKER reformulates keyframe extraction as a **generalized graph-search problem**: 1. The input video is segmented into a tree of segments. 2. A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions. 3. The selected segments are expanded (split at visual change points). 4. Visual deduplication filters near-identical frames. 5. The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached. Four search strategies are available: - **A\*** (default): balances goal-relevance and visual state changes - **BFS**: broad exploration, can select multiple segments per step - **GBFS**: greedy best-first, focuses on goal-critical actions - **Dijkstra**: focuses on maximum visual state transitions ### Usage 1. Upload a video file 2. Enter a task query (e.g., "How to send an email with an attachment?") 3. Select a search strategy 4. Click "Extract Keyframes" The model returns a gallery of keyframes with timestamps and frame indices. ### Model Uses [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as the VLM for segment evaluation, running on ZeroGPU.