HORnet: Task-Guided Frame Selection for Video Question Answering

This repository contains the checkpoints for HORNet, introduced in the paper HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models.

GitHub

HORnet is a reinforcement learning-based frame selection system for efficient video understanding. It uses Group Relative Policy Optimization (GRPO) to train a policy network that intelligently selects the most informative frames from videos, achieving up to 99% frame reduction and 93% faster processing with competitive accuracy on video question answering tasks.

Model Description

HORnet learns to select a subset of frames from videos that maximize downstream task performance. The system consists of:

  • Frame Selector: A lightweight policy network (<1M parameters) trained with GRPO that outputs frame selection probabilities.
  • Video Encoder: TimeSformer encoder for extracting spatio-temporal visual features.
  • VLM Integration: Designed to be paired with Vision-Language Models like Qwen3-VL.

Key Features

  • Massive Efficiency Gains: Reduces frames by up to 99% (from 3000+ to 8 frames) with 74-93% faster VLM processing.
  • Competitive Accuracy: Maintains or slightly improves accuracy on short videos (MSVD: +1.7%), with optimized performance on temporal reasoning tasks.
  • GRPO Training: Group-based policy optimization for stable RL training, outperforming standard SFT and competitive with PPO.
  • Parameter Efficient: Only trains the frame selector while keeping the VLM frozen.

Available Checkpoints

Checkpoint Training Scheme Use Case Size
checkpoints/long/checkpoint-0.1500.pt Long videos (256 frames) Main model for general evaluation 224MB
checkpoints/long/checkpoint-0.250.pt Long videos Multiple-choice QA evaluation 224MB
checkpoints/short/checkpoint-0.550.pt Short videos (32 frames) MSVD-QA, VLM ablation 224MB

Quick Start

The following snippet shows how to initialize the HORnet policy with a VLM. Note that this requires the VisionGRPOPolicy class from the official repository.

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

# Load VLM (e.g., Qwen3-VL)
qwen_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
qwen_processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

# Load HORnet frame selector
from model import VisionGRPOPolicy

model = VisionGRPOPolicy(
    encoder_name=None,  # Uses TimeSformer
    feat_dim=768,
    action_dim=1,
    qwen_model=qwen_model,
    qwen_processor=qwen_processor
).to("cuda")

# Load checkpoint
checkpoint = torch.load("checkpoints/long/checkpoint-0.1500.pt", map_location="cpu")
model.load_state_dict(checkpoint, strict=False)
model.eval()

Performance

Open-Ended QA Results

Dataset Model F1-Lev ↑ Qwen Proc. (s) ↓ Avg. Frames ↓
MSVD Qwen3-VL-2B (Baseline) 0.3483 0.28 11.65
MSVD HORNet+Qwen3-VL-2B 0.3543 (+1.7%) 0.10 (↓64%) 4.00 (↓66%)
MSRVTT Qwen3-VL-2B (Baseline) 0.3209 0.58 47.52
MSRVTT HORNet+Qwen3-VL-2B 0.3029 (-5.6%) 0.09 (↓84%) 4.00 (↓92%)

Multiple-Choice QA Results

Dataset Model Accuracy (%) ↑ Qwen Proc. (s) ↓ Avg. Frames ↓
VideoMME Qwen3-VL-2B (Baseline) 68.30 2.53 3066.73
VideoMME HORNet+Qwen3-VL-2B 52.10 (-16.2%) 0.18 (↓93%) 8.00 (↓99%)
ActivityNetQA Qwen3-VL-2B (Baseline) 75.00 2.37 3152.49
ActivityNetQA HORNet+Qwen3-VL-2B 68.80 (-6.2%) 0.17 (↓93%) 8.00 (↓99%)

Training Data

HORnet was trained on a combination of three video QA datasets, totaling 341,877 QA pairs across 17,350 videos:

  • MSVD-QA: Short videos (avg 9.6s)
  • MSRVTT-QA: Short videos (avg 15.5s)
  • NExT-QA: Long videos (avg 43.7s)

Citation

@article{bai2026hornet,
  title={HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models},
  author={Xiangyu Bai*, Bishoy Galoaa*, and Sarah Ostadabbas},
  journal={arXiv preprint arXiv:2603.18850},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for bishoygaloaa/HORnet