multimodalart's picture
multimodalart HF Staff
Upload folder using huggingface_hub
2f54371 verified
|
Raw
History Blame Contribute Delete
1.77 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: TASKER Keyframe Extractor
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.15.1
app_file: app.py
short_description: VLM-guided tree-search keyframe extraction from videos
python_version: '3.12'
startup_duration_timeout: 30m

TASKER Keyframe Extractor

This Space demonstrates TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm from the ECCV 2026 paper Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction.

How it works

TASKER reformulates keyframe extraction as a generalized graph-search problem:

  1. The input video is segmented into a tree of segments.
  2. A Vision-Language Model (Qwen2.5-VL-7B) evaluates which segments likely contain crucial missing actions.
  3. The selected segments are expanded (split at visual change points).
  4. Visual deduplication filters near-identical frames.
  5. The search terminates when the VLM is confident enough (confidence ≥ 3) or a frame limit is reached.

Four search strategies are available:

  • A* (default): balances goal-relevance and visual state changes
  • BFS: broad exploration, can select multiple segments per step
  • GBFS: greedy best-first, focuses on goal-critical actions
  • Dijkstra: focuses on maximum visual state transitions

Usage

  1. Upload a video file
  2. Enter a task query (e.g., "How to send an email with an attachment?")
  3. Select a search strategy
  4. Click "Extract Keyframes"

The model returns a gallery of keyframes with timestamps and frame indices.

Model

Uses Qwen/Qwen2.5-VL-7B-Instruct as the VLM for segment evaluation, running on ZeroGPU.