Spaces:

hugging-apps
/

tasker-keyframe-extractor

Running on Zero

App Files Files Community

tasker-keyframe-extractor / README.md

multimodalart HF Staff

Upload folder using huggingface_hub

2f54371 verified 1 day ago

preview code

Raw

History Blame Contribute Delete

1.77 kB

	---
	title: TASKER Keyframe Extractor
	emoji: 🔍
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: 6.15.1
	app_file: app.py
	short_description: VLM-guided tree-search keyframe extraction from videos
	python_version: "3.12"
	startup_duration_timeout: 30m
	---

	## TASKER Keyframe Extractor

	This Space demonstrates TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm from the ECCV 2026 paper [Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction](https://arxiv.org/abs/2606.29445).

	### How it works

	TASKER reformulates keyframe extraction as a generalized graph-search problem:

	1. The input video is segmented into a tree of segments.
	2. A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions.
	3. The selected segments are expanded (split at visual change points).
	4. Visual deduplication filters near-identical frames.
	5. The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached.

	Four search strategies are available:
	- A\* (default): balances goal-relevance and visual state changes
	- BFS: broad exploration, can select multiple segments per step
	- GBFS: greedy best-first, focuses on goal-critical actions
	- Dijkstra: focuses on maximum visual state transitions

	### Usage

	1. Upload a video file
	2. Enter a task query (e.g., "How to send an email with an attachment?")
	3. Select a search strategy
	4. Click "Extract Keyframes"

	The model returns a gallery of keyframes with timestamps and frame indices.

	### Model

	Uses [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as the VLM for segment evaluation, running on ZeroGPU.