File size: 5,222 Bytes
7b132bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | ---
license: apache-2.0
tags:
- pytorch
---
<a id="top"></a>
<div align="center">
<h1>π DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval</h1>
<p>
<b>Kun Wang</b><sup>1</sup>
<b>Yupeng Hu</b><sup>1β</sup>
<b>Hao Liu</b><sup>1</sup>
<b>Jiang Shao</b><sup>1</sup>
<b>Liqiang Nie</b><sup>2</sup>
</p>
<p>
<sup>1</sup>School of Software, Shandong University, Jinan, China<br>
<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China<br>
<sup>β</sup>Corresponding author
</p>
</div>
These are the official implementation, pre-trained model weights, and configuration files for **DRONE**, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.
π **Paper:** [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606)
π **GitHub Repository:** [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE)
---
## π Model Information
### 1. Model Name
**DRONE** (Cross-modal Representation Shift Refinement)
### 2. Task Type & Applicable Tasks
- **Task Type:** Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
- **Applicable Tasks:** Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.
### 3. Project Introduction
Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. **DRONE** addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.
> π‘ **Method Highlight:** DRONE introduces **Pseudo-Frame Temporal Alignment (PTA)** and **Curriculum-Guided Semantic Refinement (CSR)**. Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.
### 4. Training Data Source
The model supports and is evaluated on three standard VMR datasets:
- **ActivityNet Captions**
- **Charades-STA**
- **TACoS**
*(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))*
---
## π Usage & Basic Inference
### Step 1: Prepare the Environment
Clone the GitHub repository and set up the virtual environment:
```bash
git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE
```
```bash
python -m venv .venv
source .venv/bin/activate # Linux / Mac
# .venv\Scripts\activate # Windows
```
```bash
pip install numpy scipy pyyaml tqdm
```
### Step 2: Download Model Weights & Data
1. **Pre-trained Checkpoints:** Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`).
2. **Datasets & Features:** Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
3. **Configuration:** Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths.
### Step 3: Run Training & Evaluation
**Training from Scratch:**
Depending on the dataset you want to train on, run the following commands:
#### For ActivityNet Captions
python -m src.experiment.train --task activitynetcaptions
#### For Charades-STA
python -m src.experiment.train --task charadessta
#### For TACoS
python -m src.experiment.train --task tacos
**Evaluation (Eval):**
To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run:
python -m src.experiment.eval --exp path/to/your/experiment_folder
---
## β οΈ Limitations & Notes
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
- The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
- While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).
---
## π€ Acknowledgements & Contact
- **Acknowledgement:** This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project.
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
---
## πβοΈ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@article{wang2026cross,
title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval},
author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang},
journal={ACM Transactions on Information Systems},
volume={44},
number={3},
pages={1--30},
year={2026},
publisher={ACM New York, NY}
} |