| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| --- |
| |
| <a id="top"></a> |
| <div align="center"> |
| <h1>π DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval</h1> |
|
|
| <p> |
| <b>Kun Wang</b><sup>1</sup> |
| <b>Yupeng Hu</b><sup>1β</sup> |
| <b>Hao Liu</b><sup>1</sup> |
| <b>Jiang Shao</b><sup>1</sup> |
| <b>Liqiang Nie</b><sup>2</sup> |
| </p> |
| |
| <p> |
| <sup>1</sup>School of Software, Shandong University, Jinan, China<br> |
| <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China<br> |
| <sup>β</sup>Corresponding author |
| </p> |
| </div> |
| |
| These are the official implementation, pre-trained model weights, and configuration files for **DRONE**, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift. |
|
|
| π **Paper:** [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606) |
| π **GitHub Repository:** [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE) |
|
|
| --- |
|
|
| ## π Model Information |
|
|
| ### 1. Model Name |
| **DRONE** (Cross-modal Representation Shift Refinement) |
|
|
| ### 2. Task Type & Applicable Tasks |
| - **Task Type:** Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning |
| - **Applicable Tasks:** Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts. |
|
|
| ### 3. Project Introduction |
| Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. **DRONE** addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations. |
|
|
| > π‘ **Method Highlight:** DRONE introduces **Pseudo-Frame Temporal Alignment (PTA)** and **Curriculum-Guided Semantic Refinement (CSR)**. Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively. |
|
|
| ### 4. Training Data Source |
| The model supports and is evaluated on three standard VMR datasets: |
| - **ActivityNet Captions** |
| - **Charades-STA** |
| - **TACoS** |
| *(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))* |
|
|
| --- |
|
|
| ## π Usage & Basic Inference |
|
|
| ### Step 1: Prepare the Environment |
| Clone the GitHub repository and set up the virtual environment: |
| ```bash |
| git clone https://github.com/iLearn-Lab/DRONE.git |
| cd DRONE |
| ``` |
| ```bash |
| python -m venv .venv |
| source .venv/bin/activate # Linux / Mac |
| # .venv\Scripts\activate # Windows |
| ``` |
| ```bash |
| pip install numpy scipy pyyaml tqdm |
| ``` |
|
|
| ### Step 2: Download Model Weights & Data |
| 1. **Pre-trained Checkpoints:** Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`). |
| 2. **Datasets & Features:** Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS. |
| 3. **Configuration:** Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths. |
|
|
| ### Step 3: Run Training & Evaluation |
|
|
| **Training from Scratch:** |
| Depending on the dataset you want to train on, run the following commands: |
|
|
| #### For ActivityNet Captions |
| python -m src.experiment.train --task activitynetcaptions |
|
|
| #### For Charades-STA |
| python -m src.experiment.train --task charadessta |
|
|
| #### For TACoS |
| python -m src.experiment.train --task tacos |
|
|
|
|
| **Evaluation (Eval):** |
| To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run: |
|
|
| python -m src.experiment.eval --exp path/to/your/experiment_folder |
| |
| --- |
| |
| ## β οΈ Limitations & Notes |
| |
| **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**. |
| - The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation. |
| - While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG). |
| |
| --- |
| |
| ## π€ Acknowledgements & Contact |
| |
| - **Acknowledgement:** This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project. |
| - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`. |
| |
| --- |
| |
| ## πβοΈ Citation |
| |
| If you find our work or this repository useful in your research, please consider citing our paper: |
| |
| |
| @article{wang2026cross, |
| title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval}, |
| author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang}, |
| journal={ACM Transactions on Information Systems}, |
| volume={44}, |
| number={3}, |
| pages={1--30}, |
| year={2026}, |
| publisher={ACM New York, NY} |
| } |