|
|
| --- |
| license: apache-2.0 |
| task_categories: |
| - video-retrieval |
| - image-retrieval |
| tags: |
| - composed-video-retrieval |
| - composed-image-retrieval |
| - vision-language |
| - pytorch |
| - icassp-2026 |
| --- |
| |
| <a id="top"></a> |
| <div align="center"> |
| <h1>π¬ (ICASSP 2026) RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration (Model Weights)</h1> |
| <div> |
| Shiqi Zhang<sup>1</sup>, |
| <a target="_blank" href="https://zivchen-ty.github.io/">Zhiwei Chen</a><sup>1</sup>, |
| <a target="_blank" href="https://lee-zixu.github.io/">Zixu Li</a><sup>1</sup>, |
| <a target="_blank" href="https://zhihfu.github.io/">Zhiheng Fu</a><sup>1</sup>, |
| Wenbo Wang<sup>1</sup>, |
| Jiajia Nie<sup>1</sup>, |
| <a target="_blank" href="https://faculty.sdu.edu.cn/weiyinwei1/zh_CN/index.htm">Yinwei Wei</a><sup>1</sup>, |
| <a target="_blank" href="https://faculty.sdu.edu.cn/huyupeng1/zh_CN/index.htm">Yupeng Hu</a><sup>1✉</sup> |
| </div> |
| <sup>1</sup>School of Software, Shandong University    </span> <br> |
| <sup>✉ </sup>Corresponding author  </span> |
| <br/> |
| <p> |
| <a href="https://arxiv.org/abs/coming soon"><img alt='arXiv' src="https://img.shields.io/badge/arXiv-Coming.Soon-b31b1b.svg?style=flat-square"></a> |
| <a href="https://github.com/iLearn-Lab/ICASSP26-RELATE"><img alt='GitHub' src="https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github"></a> |
| </p> |
| </div> |
| |
| This repository hosts the official pre-trained model weights for **RELATE**, a minimal-redundancy hierarchical collaborative network designed to enhance both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks. |
|
|
|
|
| --- |
|
|
| ## π Model Information |
|
|
| ### 1. Model Name |
| **RELATE** (Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration) Checkpoints. |
|
|
| ### 2. Task Type & Applicable Tasks |
| - **Task Type:** Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR). |
| - **Applicable Tasks:** Retrieving target videos or images based on a reference visual input and modification text. The model excels by addressing the neglect of the internal hierarchical structure of modification texts and the insufficient suppression of video temporal redundancy. |
|
|
| ### 3. Project Introduction |
| **RELATE** is an advanced open-source PyTorch framework built on top of BLIP-2. It achieves State-of-the-Art (SOTA) performance across major benchmarks through three key innovations: |
| - π§© **Hierarchical Query Generation:** Parses the internal hierarchical structure of the text to understand the roles of various parts of speech, using noun phrases for object-level features and the complete text for global semantics. |
| - βοΈ **Temporal Sparsification:** Adaptively attenuates redundant tokens corresponding to static backgrounds while amplifying critical dynamic information tokens. |
| - π― **Modification-Driven Modulation Learning:** Leverages global semantics of the modification text to perform attention-based filtering on the sparsified visual features. |
|
|
| ### 4. Training Data Source & Hosted Weights |
| The RELATE framework seamlessly supports and is evaluated on standard video and image retrieval benchmarks. This repository provides pre-trained weights for the following datasets: |
| * **CVR:** WebVid-CoVR dataset. |
| * **CIR:** FashionIQ and CIRR datasets. |
|
|
| *(Note: Please download the respective `.ckpt` or `.pt` files hosted in the "Files and versions" tab of this Hugging Face repository).* |
|
|
| --- |
|
|
| ## π Usage & Basic Inference |
|
|
| These weights are designed to be evaluated using the official Hydra-configured [RELATE GitHub repository](https://github.com/iLearn-Lab/ICASSP26-RELATE). |
|
|
| ### Step 1: Prepare the Environment |
| We recommend using Anaconda to manage your environment. Clone the repository and install the required dependencies: |
| ```bash |
| git clone https://github.com/iLearn-Lab/ICASSP26-RELATE |
| cd RELATE |
| conda create -n relate python=3.8 -y |
| conda activate relate |
| conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Step 2: Download Model Weights |
| Download the required checkpoints from this repository and place them into your local workspace. Ensure your dataset paths are correctly configured in `configs/machine/default.yaml`. |
|
|
| ### Step 3: Run Evaluation |
| To evaluate a trained model, use `test.py` and specify the target benchmark and your checkpoint path via Hydra overrides: |
| ```bash |
| python test.py \ |
| model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \ |
| +test=webvid-covr # or fashioniq / cirr-all |
| ``` |
|
|
| --- |
|
|
| ## β οΈ Limitations & Notes |
|
|
| - **Configuration:** The entire framework is managed by **Hydra** and **Lightning Fabric**. Ensure you adjust hyperparameter overrides or modify the YAML files in the `configs/` directory to suit your specific local setup. |
| - **Environment Dependency:** This project was developed and extensively tested with Python 3.8 and PyTorch 2.1.0. |
|
|
| --- |
|
|
| ## πβοΈ Citation |
|
|
| If you find our framework, code, or these weights useful in your research, please consider leaving a **Star** βοΈ on our GitHub repository and citing our ICASSP 2026 paper: |
|
|
| ```bibtex |
| @inproceedings{RELATE, |
| title={RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration}, |
| author={Zhang, Shiqi and Chen, Zhiwei and Li, Zixu and Fu, Zhiheng and Wang, Wenbo and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng}, |
| booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing}, |
| year={2026} |
| } |
| ``` |
|
|