Spaces:

hugging-apps
/

tasker-keyframe-extractor

Running on Zero

App Files Files Community

tasker-keyframe-extractor / README.md

multimodalart HF Staff

Upload folder using huggingface_hub

2f54371 verified about 12 hours ago

preview code

Raw

History Blame Contribute Delete

1.77 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: TASKER Keyframe Extractor
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.15.1
app_file: app.py
short_description: VLM-guided tree-search keyframe extraction from videos
python_version: '3.12'
startup_duration_timeout: 30m

TASKER Keyframe Extractor

This Space demonstrates TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm from the ECCV 2026 paper Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction.

How it works

TASKER reformulates keyframe extraction as a generalized graph-search problem:

The input video is segmented into a tree of segments.
A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions.
The selected segments are expanded (split at visual change points).
Visual deduplication filters near-identical frames.
The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached.

Four search strategies are available:

A* (default): balances goal-relevance and visual state changes
BFS: broad exploration, can select multiple segments per step
GBFS: greedy best-first, focuses on goal-critical actions
Dijkstra: focuses on maximum visual state transitions

Usage

Upload a video file
Enter a task query (e.g., "How to send an email with an attachment?")
Select a search strategy
Click "Extract Keyframes"

The model returns a gallery of keyframes with timestamps and frame indices.

Model

Uses Qwen/Qwen2.5-VL-7B-Instruct as the VLM for segment evaluation, running on ZeroGPU.