Video-Text-to-Text
Transformers
Safetensors
qwen3_vl
image-text-to-text
video-retrieval
temporal-grounding
videosearch-r1
Instructions to use VideoSearchR1/didemo-stage2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use VideoSearchR1/didemo-stage2 with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("VideoSearchR1/didemo-stage2") model = AutoModelForMultimodalLM.from_pretrained("VideoSearchR1/didemo-stage2") - Notebooks
- Google Colab
- Kaggle
metadata
base_model: Qwen/Qwen3-VL-4B-Instruct
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
- video-retrieval
- temporal-grounding
- videosearch-r1
VideoSearch-R1 DiDeMo Stage 2
This is the Stage 2 VideoSearch-R1 checkpoint trained for DiDeMo, presented in the paper VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement.
- Project Page: mlvlab.github.io/VideoSearch-R1
- Repository: GitHub - mlvlab/VideoSearch-R1
Usage
Use with the VideoSearch-R1 codebase:
bash scripts/data_construct/download_preextracted_data.bash didemo
EVAL_GPUS=0 bash scripts/inference/inference.bash didemo --checkpoint VideoSearchR1/didemo-stage2
Citation
@inproceedings{lee2026videosearchr1,
title = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
author = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}