---
base_model: Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
- video-retrieval
- temporal-grounding
- videosearch-r1
---

# VideoSearch-R1 ActivityNet Stage 2

This is the Stage 2 VideoSearch-R1 checkpoint trained for ActivityNet, presented in the paper [VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement](https://huggingface.co/papers/2607.00446).

- **Project Page:** [https://mlvlab.github.io/VideoSearch-R1/](https://mlvlab.github.io/VideoSearch-R1/)
- **Repository:** [https://github.com/mlvlab/VideoSearch-R1](https://github.com/mlvlab/VideoSearch-R1)

Stage 2 starts from the ActivityNet Stage 1 checkpoint and optimizes iterative retrieval and temporal grounding behavior with the VideoSearch-R1 training pipeline.

## Usage

Use with the VideoSearch-R1 codebase:

```bash
bash scripts/data_construct/download_preextracted_data.bash activitynet
EVAL_GPUS=0 bash scripts/inference/inference.bash activitynet --checkpoint VideoSearchR1/activitynet-stage2
```

## Citation

```bibtex
@inproceedings{lee2026videosearchr1,
  title     = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
  author    = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
```