VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement
Paper • 2607.00446 • Published • 16
How to use VideoSearchR1/activitynet-stage2 with Transformers:
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("VideoSearchR1/activitynet-stage2")
model = AutoModelForMultimodalLM.from_pretrained("VideoSearchR1/activitynet-stage2")This is the Stage 2 VideoSearch-R1 checkpoint trained for ActivityNet, presented in the paper VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement.
Stage 2 starts from the ActivityNet Stage 1 checkpoint and optimizes iterative retrieval and temporal grounding behavior with the VideoSearch-R1 training pipeline.
Use with the VideoSearch-R1 codebase:
bash scripts/data_construct/download_preextracted_data.bash activitynet
EVAL_GPUS=0 bash scripts/inference/inference.bash activitynet --checkpoint VideoSearchR1/activitynet-stage2
@inproceedings{lee2026videosearchr1,
title = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
author = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
Base model
Qwen/Qwen3-VL-4B-Instruct
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("VideoSearchR1/activitynet-stage2") model = AutoModelForMultimodalLM.from_pretrained("VideoSearchR1/activitynet-stage2")