--- base_model: Qwen/Qwen3-VL-4B-Instruct library_name: transformers license: apache-2.0 pipeline_tag: video-text-to-text tags: - video-retrieval - temporal-grounding - videosearch-r1 --- # VideoSearch-R1 ActivityNet Stage 2 This is the Stage 2 VideoSearch-R1 checkpoint trained for ActivityNet, presented in the paper [VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement](https://huggingface.co/papers/2607.00446). - **Project Page:** [https://mlvlab.github.io/VideoSearch-R1/](https://mlvlab.github.io/VideoSearch-R1/) - **Repository:** [https://github.com/mlvlab/VideoSearch-R1](https://github.com/mlvlab/VideoSearch-R1) Stage 2 starts from the ActivityNet Stage 1 checkpoint and optimizes iterative retrieval and temporal grounding behavior with the VideoSearch-R1 training pipeline. ## Usage Use with the VideoSearch-R1 codebase: ```bash bash scripts/data_construct/download_preextracted_data.bash activitynet EVAL_GPUS=0 bash scripts/inference/inference.bash activitynet --checkpoint VideoSearchR1/activitynet-stage2 ``` ## Citation ```bibtex @inproceedings{lee2026videosearchr1, title = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement}, author = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2026} } ```