--- license: apache-2.0 task_categories: - video-retrieval - image-retrieval tags: - composed-video-retrieval - composed-image-retrieval - vision-language - pytorch - icassp-2026 ---

🎬 (ICASSP 2026) RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration (Model Weights)

Shiqi Zhang1, Zhiwei Chen1, Zixu Li1, Zhiheng Fu1, Wenbo Wang1, Jiajia Nie1, Yinwei Wei1, Yupeng Hu1✉
1School of Software, Shandong University    
✉ Corresponding author  

arXiv GitHub

This repository hosts the official pre-trained model weights for **RELATE**, a minimal-redundancy hierarchical collaborative network designed to enhance both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks. --- ## 📌 Model Information ### 1. Model Name **RELATE** (Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration) Checkpoints. ### 2. Task Type & Applicable Tasks - **Task Type:** Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR). - **Applicable Tasks:** Retrieving target videos or images based on a reference visual input and modification text. The model excels by addressing the neglect of the internal hierarchical structure of modification texts and the insufficient suppression of video temporal redundancy. ### 3. Project Introduction **RELATE** is an advanced open-source PyTorch framework built on top of BLIP-2. It achieves State-of-the-Art (SOTA) performance across major benchmarks through three key innovations: - 🧩 **Hierarchical Query Generation:** Parses the internal hierarchical structure of the text to understand the roles of various parts of speech, using noun phrases for object-level features and the complete text for global semantics. - ✂️ **Temporal Sparsification:** Adaptively attenuates redundant tokens corresponding to static backgrounds while amplifying critical dynamic information tokens. - 🎯 **Modification-Driven Modulation Learning:** Leverages global semantics of the modification text to perform attention-based filtering on the sparsified visual features. ### 4. Training Data Source & Hosted Weights The RELATE framework seamlessly supports and is evaluated on standard video and image retrieval benchmarks. This repository provides pre-trained weights for the following datasets: * **CVR:** WebVid-CoVR dataset. * **CIR:** FashionIQ and CIRR datasets. *(Note: Please download the respective `.ckpt` or `.pt` files hosted in the "Files and versions" tab of this Hugging Face repository).* --- ## 🚀 Usage & Basic Inference These weights are designed to be evaluated using the official Hydra-configured [RELATE GitHub repository](https://github.com/iLearn-Lab/ICASSP26-RELATE). ### Step 1: Prepare the Environment We recommend using Anaconda to manage your environment. Clone the repository and install the required dependencies: ```bash git clone https://github.com/iLearn-Lab/ICASSP26-RELATE cd RELATE conda create -n relate python=3.8 -y conda activate relate conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt ``` ### Step 2: Download Model Weights Download the required checkpoints from this repository and place them into your local workspace. Ensure your dataset paths are correctly configured in `configs/machine/default.yaml`. ### Step 3: Run Evaluation To evaluate a trained model, use `test.py` and specify the target benchmark and your checkpoint path via Hydra overrides: ```bash python test.py \ model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \ +test=webvid-covr # or fashioniq / cirr-all ``` --- ## ⚠️ Limitations & Notes - **Configuration:** The entire framework is managed by **Hydra** and **Lightning Fabric**. Ensure you adjust hyperparameter overrides or modify the YAML files in the `configs/` directory to suit your specific local setup. - **Environment Dependency:** This project was developed and extensively tested with Python 3.8 and PyTorch 2.1.0. --- ## 📝⭐️ Citation If you find our framework, code, or these weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repository and citing our ICASSP 2026 paper: ```bibtex @inproceedings{RELATE, title={RELATE: Enhance Composed Video Retrieval via Minimal-Redundancy Hierarchical Collaboration}, author={Zhang, Shiqi and Chen, Zhiwei and Li, Zixu and Fu, Zhiheng and Wang, Wenbo and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng}, booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2026} } ```