Update README.md

149be08 verified about 1 month ago

5.22 kB

	---
	license: apache-2.0
	tags:
	- pytorch
	---

	<a id="top"></a>
	<div align="center">
	<h1>🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval</h1>

	<p>
	<b>Kun Wang</b><sup>1</sup>
	<b>Yupeng Hu</b><sup>1✉</sup>
	<b>Hao Liu</b><sup>1</sup>
	<b>Jiang Shao</b><sup>1</sup>
	<b>Liqiang Nie</b><sup>2</sup>
	</p>

	<p>
	<sup>1</sup>School of Software, Shandong University, Jinan, China<br>
	<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China<br>
	<sup>✉</sup>Corresponding author
	</p>
	</div>

	These are the official implementation, pre-trained model weights, and configuration files for DRONE, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.

	🔗 Paper: [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606)
	🔗 GitHub Repository: [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE)

	---

	## 📌 Model Information

	### 1. Model Name
	DRONE (Cross-modal Representation Shift Refinement)

	### 2. Task Type & Applicable Tasks
	- Task Type: Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
	- Applicable Tasks: Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.

	### 3. Project Introduction
	Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. DRONE addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.

	> 💡 Method Highlight: DRONE introduces Pseudo-Frame Temporal Alignment (PTA) and Curriculum-Guided Semantic Refinement (CSR). Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.

	### 4. Training Data Source
	The model supports and is evaluated on three standard VMR datasets:
	- ActivityNet Captions
	- Charades-STA
	- TACoS
	(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))

	---

	## 🚀 Usage & Basic Inference

	### Step 1: Prepare the Environment
	Clone the GitHub repository and set up the virtual environment:
	```bash
	git clone https://github.com/iLearn-Lab/DRONE.git
	cd DRONE
	```
	```bash
	python -m venv .venv
	source .venv/bin/activate # Linux / Mac
	# .venv\Scripts\activate # Windows
	```
	```bash
	pip install numpy scipy pyyaml tqdm
	```

	### Step 2: Download Model Weights & Data
	1. Pre-trained Checkpoints: Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`).
	2. Datasets & Features: Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
	3. Configuration: Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths.

	### Step 3: Run Training & Evaluation

	Training from Scratch:
	Depending on the dataset you want to train on, run the following commands:

	#### For ActivityNet Captions
	python -m src.experiment.train --task activitynetcaptions

	#### For Charades-STA
	python -m src.experiment.train --task charadessta

	#### For TACoS
	python -m src.experiment.train --task tacos


	Evaluation (Eval):
	To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run:

	python -m src.experiment.eval --exp path/to/your/experiment_folder

	---

	## ⚠️ Limitations & Notes

	Disclaimer: This framework and its pre-trained weights are intended for academic research purposes only.
	- The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
	- While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).

	---

	## 🤝 Acknowledgements & Contact

	- Acknowledgement: This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project.
	- Contact: If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.

	---

	## 📝⭐️ Citation

	If you find our work or this repository useful in your research, please consider citing our paper:


	@article{wang2026cross,
	title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval},
	author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang},
	journal={ACM Transactions on Information Systems},
	volume={44},
	number={3},
	pages={1--30},
	year={2026},
	publisher={ACM New York, NY}
	}