HORnet: Task-Guided Frame Selection for Video Question Answering

This repository contains the checkpoints for HORNet, introduced in the paper HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models.

HORnet is a reinforcement learning-based frame selection system for efficient video understanding. It uses Group Relative Policy Optimization (GRPO) to train a policy network that intelligently selects the most informative frames from videos, achieving up to 99% frame reduction and 93% faster processing with competitive accuracy on video question answering tasks.

Model Description

HORnet learns to select a subset of frames from videos that maximize downstream task performance. The system consists of:

Frame Selector: A lightweight policy network (<1M parameters) trained with GRPO that outputs frame selection probabilities.
Video Encoder: TimeSformer encoder for extracting spatio-temporal visual features.
VLM Integration: Designed to be paired with Vision-Language Models like Qwen3-VL.

Key Features

Massive Efficiency Gains: Reduces frames by up to 99% (from 3000+ to 8 frames) with 74-93% faster VLM processing.
Competitive Accuracy: Maintains or slightly improves accuracy on short videos (MSVD: +1.7%), with optimized performance on temporal reasoning tasks.
GRPO Training: Group-based policy optimization for stable RL training, outperforming standard SFT and competitive with PPO.
Parameter Efficient: Only trains the frame selector while keeping the VLM frozen.

Available Checkpoints

Checkpoint	Training Scheme	Use Case	Size
`checkpoints/long/checkpoint-0.1500.pt`	Long videos (256 frames)	Main model for general evaluation	224MB
`checkpoints/long/checkpoint-0.250.pt`	Long videos	Multiple-choice QA evaluation	224MB
`checkpoints/short/checkpoint-0.550.pt`	Short videos (32 frames)	MSVD-QA, VLM ablation	224MB

Quick Start

The following snippet shows how to initialize the HORnet policy with a VLM. Note that this requires the VisionGRPOPolicy class from the official repository.

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

# Load VLM (e.g., Qwen3-VL)
qwen_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
qwen_processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

# Load HORnet frame selector
from model import VisionGRPOPolicy

model = VisionGRPOPolicy(
    encoder_name=None,  # Uses TimeSformer
    feat_dim=768,
    action_dim=1,
    qwen_model=qwen_model,
    qwen_processor=qwen_processor
).to("cuda")

# Load checkpoint
checkpoint = torch.load("checkpoints/long/checkpoint-0.1500.pt", map_location="cpu")
model.load_state_dict(checkpoint, strict=False)
model.eval()

Performance

Open-Ended QA Results

Dataset	Model	F1-Lev ↑	Qwen Proc. (s) ↓	Avg. Frames ↓
MSVD	Qwen3-VL-2B (Baseline)	0.3483	0.28	11.65
MSVD	HORNet+Qwen3-VL-2B	0.3543 (+1.7%)	0.10 (↓64%)	4.00 (↓66%)
MSRVTT	Qwen3-VL-2B (Baseline)	0.3209	0.58	47.52
MSRVTT	HORNet+Qwen3-VL-2B	0.3029 (-5.6%)	0.09 (↓84%)	4.00 (↓92%)

Multiple-Choice QA Results

Dataset	Model	Accuracy (%) ↑	Qwen Proc. (s) ↓	Avg. Frames ↓
VideoMME	Qwen3-VL-2B (Baseline)	68.30	2.53	3066.73
VideoMME	HORNet+Qwen3-VL-2B	52.10 (-16.2%)	0.18 (↓93%)	8.00 (↓99%)
ActivityNetQA	Qwen3-VL-2B (Baseline)	75.00	2.37	3152.49
ActivityNetQA	HORNet+Qwen3-VL-2B	68.80 (-6.2%)	0.17 (↓93%)	8.00 (↓99%)

Training Data

HORnet was trained on a combination of three video QA datasets, totaling 341,877 QA pairs across 17,350 videos:

MSVD-QA: Short videos (avg 9.6s)
MSRVTT-QA: Short videos (avg 15.5s)
NExT-QA: Long videos (avg 43.7s)

Citation

@article{bai2026hornet,
  title={HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models},
  author={Xiangyu Bai*, Bishoy Galoaa*, and Sarah Ostadabbas},
  journal={arXiv preprint arXiv:2603.18850},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for bishoygaloaa/HORnet

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Paper • 2603.18850 • Published Mar 19