HORnet: Task-Guided Frame Selection for Video Question Answering
This repository contains the checkpoints for HORNet, introduced in the paper HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models.
HORnet is a reinforcement learning-based frame selection system for efficient video understanding. It uses Group Relative Policy Optimization (GRPO) to train a policy network that intelligently selects the most informative frames from videos, achieving up to 99% frame reduction and 93% faster processing with competitive accuracy on video question answering tasks.
Model Description
HORnet learns to select a subset of frames from videos that maximize downstream task performance. The system consists of:
- Frame Selector: A lightweight policy network (<1M parameters) trained with GRPO that outputs frame selection probabilities.
- Video Encoder: TimeSformer encoder for extracting spatio-temporal visual features.
- VLM Integration: Designed to be paired with Vision-Language Models like Qwen3-VL.
Key Features
- Massive Efficiency Gains: Reduces frames by up to 99% (from 3000+ to 8 frames) with 74-93% faster VLM processing.
- Competitive Accuracy: Maintains or slightly improves accuracy on short videos (MSVD: +1.7%), with optimized performance on temporal reasoning tasks.
- GRPO Training: Group-based policy optimization for stable RL training, outperforming standard SFT and competitive with PPO.
- Parameter Efficient: Only trains the frame selector while keeping the VLM frozen.
Available Checkpoints
| Checkpoint | Training Scheme | Use Case | Size |
|---|---|---|---|
checkpoints/long/checkpoint-0.1500.pt |
Long videos (256 frames) | Main model for general evaluation | 224MB |
checkpoints/long/checkpoint-0.250.pt |
Long videos | Multiple-choice QA evaluation | 224MB |
checkpoints/short/checkpoint-0.550.pt |
Short videos (32 frames) | MSVD-QA, VLM ablation | 224MB |
Quick Start
The following snippet shows how to initialize the HORnet policy with a VLM. Note that this requires the VisionGRPOPolicy class from the official repository.
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
# Load VLM (e.g., Qwen3-VL)
qwen_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
qwen_processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
# Load HORnet frame selector
from model import VisionGRPOPolicy
model = VisionGRPOPolicy(
encoder_name=None, # Uses TimeSformer
feat_dim=768,
action_dim=1,
qwen_model=qwen_model,
qwen_processor=qwen_processor
).to("cuda")
# Load checkpoint
checkpoint = torch.load("checkpoints/long/checkpoint-0.1500.pt", map_location="cpu")
model.load_state_dict(checkpoint, strict=False)
model.eval()
Performance
Open-Ended QA Results
| Dataset | Model | F1-Lev β | Qwen Proc. (s) β | Avg. Frames β |
|---|---|---|---|---|
| MSVD | Qwen3-VL-2B (Baseline) | 0.3483 | 0.28 | 11.65 |
| MSVD | HORNet+Qwen3-VL-2B | 0.3543 (+1.7%) | 0.10 (β64%) | 4.00 (β66%) |
| MSRVTT | Qwen3-VL-2B (Baseline) | 0.3209 | 0.58 | 47.52 |
| MSRVTT | HORNet+Qwen3-VL-2B | 0.3029 (-5.6%) | 0.09 (β84%) | 4.00 (β92%) |
Multiple-Choice QA Results
| Dataset | Model | Accuracy (%) β | Qwen Proc. (s) β | Avg. Frames β |
|---|---|---|---|---|
| VideoMME | Qwen3-VL-2B (Baseline) | 68.30 | 2.53 | 3066.73 |
| VideoMME | HORNet+Qwen3-VL-2B | 52.10 (-16.2%) | 0.18 (β93%) | 8.00 (β99%) |
| ActivityNetQA | Qwen3-VL-2B (Baseline) | 75.00 | 2.37 | 3152.49 |
| ActivityNetQA | HORNet+Qwen3-VL-2B | 68.80 (-6.2%) | 0.17 (β93%) | 8.00 (β99%) |
Training Data
HORnet was trained on a combination of three video QA datasets, totaling 341,877 QA pairs across 17,350 videos:
- MSVD-QA: Short videos (avg 9.6s)
- MSRVTT-QA: Short videos (avg 15.5s)
- NExT-QA: Long videos (avg 43.7s)
Citation
@article{bai2026hornet,
title={HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models},
author={Xiangyu Bai*, Bishoy Galoaa*, and Sarah Ostadabbas},
journal={arXiv preprint arXiv:2603.18850},
year={2026}
}