Enhance model card for VideoTG-R1
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,139 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
datasets:
|
| 6 |
+
- yeliudev/VideoMind-Dataset
|
| 7 |
+
tags:
|
| 8 |
+
- video-temporal-grounding
|
| 9 |
+
- multimodal-llm
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- curriculum-learning
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
<h2 align="center">VideoTG-R1: Boosting Video Temporal Grounding via
|
| 15 |
+
Curriculum Reinforcement Learning on Reflected Boundary Annotations</h2>
|
| 16 |
+
|
| 17 |
+
<p align="center">
|
| 18 |
+
<a href="https://huggingface.co/papers/2510.23397" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%93%96%20Paper-Hugging%20Face-ffbd45.svg"></a>
|
| 19 |
+
<a href="https://github.com/ldong1111/VideoTG-R1" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%92%BB%20Code-GitHub-brightgreen.svg"></a>
|
| 20 |
+
</p>
|
| 21 |
+
|
| 22 |
+
<p align="center">
|
| 23 |
+
Lu Dong<sup>1,3</sup>, Haiyu Zhang<sup>2,3</sup>, Han Lin<sup>4,3</sup>, Ziang Yan<sup>5,3</sup>, Xiangyu Zeng<sup>6,3</sup>, Hongjie Zhang<sup>3</sup>, Yifei Huang<sup>3</sup>, Yi Wang<sup>3</sup>, Zhen-Hua Ling<sup>1</sup>, Limin Wang<sup>6,3</sup>, Yali Wang<sup>7,3</sup><sup>†</sup>
|
| 24 |
+
<p align="center">
|
| 25 |
+
<sup>1</sup>University of Science and Technology of China
|
| 26 |
+
<sup>2</sup>Beihang University
|
| 27 |
+
<sup>3</sup>Shanghai Artificial Intelligence Laboratory<br>
|
| 28 |
+
<sup>4</sup>Shanghai Jiao Tong University
|
| 29 |
+
<sup>5</sup>Zhejiang University
|
| 30 |
+
<sup>6</sup>Nanjing University
|
| 31 |
+
<sup>7</sup>Chinese Academy of Sciences
|
| 32 |
+
</p>
|
| 33 |
+
</p>
|
| 34 |
+
|
| 35 |
+
<p align="center">† Corresponding author</p>
|
| 36 |
+
|
| 37 |
+

|
| 38 |
+
|
| 39 |
+
**VideoTG-R1** is a multi-agent system for data-efficient video temporal grounding. It contains three modules: 1) Boundary Reflection Agent filters the training data by identifying and discarding partially annotated samples; 2) Difficulty Estimation Agent estimates the difficulty of each sample via zero-shot evaluation; and 3) Curriculum RL strategy dynamically masks the videos of hard-to-ground samples according to the training steps, to ease the training difficulty of hard-to-ground samples. VideoTG-R1 achieves state-of-the-art performance on Charades-STA and ActivityNet-Captions. Moreover, with only 10% of the training dataset, our method outperforms models trained on the full dataset under both GRPO and SFT paradigms.
|
| 40 |
+
|
| 41 |
+
### Abstract
|
| 42 |
+
Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at this https URL .
|
| 43 |
+
|
| 44 |
+

|
| 45 |
+
|
| 46 |
+

|
| 47 |
+
|
| 48 |
+
## Code Environment Preparation
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
pip install requirements.txt
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Dataset Preparation
|
| 55 |
+
|
| 56 |
+
All training and evaluation datasets can be download in [VideoMind's huggingface](https://huggingface.co/datasets/yeliudev/VideoMind-Dataset/tree/main).
|
| 57 |
+
|
| 58 |
+
## Evaluation
|
| 59 |
+
1. video grounding
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
cd Eval
|
| 63 |
+
|
| 64 |
+
your_ckpt=xxx
|
| 65 |
+
dataset_name=charades # or anet
|
| 66 |
+
|
| 67 |
+
bash video_grounding.sh ${dataset_name}
|
| 68 |
+
```
|
| 69 |
+
2. grounded qa
|
| 70 |
+
```bash
|
| 71 |
+
cd Eval
|
| 72 |
+
|
| 73 |
+
your_ckpt=xxx
|
| 74 |
+
dataset_name=rextime # or nextgqa
|
| 75 |
+
|
| 76 |
+
bash grounded_qa.sh ${dataset_name}
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Fast Training
|
| 80 |
+
1. You should download the **intermediate_results** in the huggingface first.
|
| 81 |
+
```bash
|
| 82 |
+
cd ./videotg_r1
|
| 83 |
+
|
| 84 |
+
bash grpo_train.sh
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Full Training
|
| 88 |
+
1. **boundary reflection agent**
|
| 89 |
+
1. evaluate each dataset from 'qvhighlights,didemo,tacos,queryd,hirest_grounding,hirest_step,cosmo_cap,internvid_vtime'
|
| 90 |
+
2. you can run the code with single/multiple gpus.
|
| 91 |
+
```bash
|
| 92 |
+
cd ./BRA
|
| 93 |
+
|
| 94 |
+
dataset_name=cosmo_cap
|
| 95 |
+
each_fold_size=22000
|
| 96 |
+
fold_index=0
|
| 97 |
+
|
| 98 |
+
bash bra_test.sh ${dataset_name} ${each_fold_size} ${fold_index}
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
2. **difficulty estimation agent**
|
| 102 |
+
1. evaluate each dataset from 'qvhighlights,didemo,tacos,queryd,hirest_grounding,hirest_step,cosmo_cap,internvid_vtime'
|
| 103 |
+
2. you can run the code with single/multiple gpus.
|
| 104 |
+
```bash
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
cd ./DEA
|
| 108 |
+
|
| 109 |
+
dataset_name=cosmo_cap
|
| 110 |
+
each_fold_size=22000
|
| 111 |
+
fold_index=0
|
| 112 |
+
|
| 113 |
+
bash bra_test.sh ${dataset_name} ${each_fold_size} ${fold_index}
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
3. **curriculum grpo**
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
|
| 121 |
+
cd ./videotg_r1
|
| 122 |
+
|
| 123 |
+
bash grpo_train.sh
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Acknowledgement
|
| 128 |
+
VideoTG-R1 is built with reference to the following projects: [VideoMind](https://github.com/yeliudev/VideoMind) and [VideoChat-R1](https://github.com/OpenGVLab/VideoChat-R1). Thanks for their work!
|
| 129 |
+
|
| 130 |
+
## Citation
|
| 131 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 132 |
+
```bibtex
|
| 133 |
+
@article{shen2024longvu,
|
| 134 |
+
author ={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
|
| 135 |
+
title = {LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
|
| 136 |
+
journal = {arXiv preprint arXiv:2410.17434},
|
| 137 |
+
year = {2024},
|
| 138 |
+
}
|
| 139 |
+
```
|