--- license: apache-2.0 task_categories: - video-retrieval - image-retrieval tags: - composed-video-retrieval - composed-image-retrieval - multimodal-retrieval - vision-language - pytorch - acm-mm-2025 ---

📹 (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval (Model Weights)

Zhiwei Chen¹, Yupeng Hu^1✉, Zixu Li¹, Zhiheng Fu¹, Haokun Wen², Weili Guan²

¹School of Software, Shandong University
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
^✉Corresponding author

This repository hosts the official pre-trained model weights for **HUD**, a novel framework tackling both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks by explicitly leveraging the disparity in information density between modalities. --- ## 📌 Model Information ### 1. Model Name **HUD** (Hierarchical Uncertainty-Aware Disambiguation Network) Checkpoints. ### 2. Task Type & Applicable Tasks - **Task Type:** Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR). - **Applicable Tasks:** Retrieving a target video or image based on a reference visual input and a text modifier. HUD excels at addressing modification subject referring ambiguity and limited detailed semantic focus. ### 3. Project Introduction **HUD** is the first framework that explicitly leverages the disparity in information density between video and text. It achieves State-of-the-Art (SOTA) performance through three key modules: - 🎯 **Holistic Pronoun Disambiguation:** Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate pronoun referents. - 🔍 **Atomistic Uncertainty Modeling:** Discerns key detail semantics via uncertainty modeling at the atomistic level, enhancing focus on fine-grained visual details. - ⚖️ **Holistic-to-Atomistic Alignment:** Adaptively aligns the composed query representation with the target media by incorporating a learnable similarity bias. ### 4. Training Data Source & Hosted Weights The HUD framework supports both video and image retrieval benchmarks. This repository provides pre-trained checkpoints evaluated on the following datasets: * **CVR:** WebVid-CoVR dataset. * **CIR:** FashionIQ and CIRR datasets. *(Note: Download the respective `.ckpt` files hosted in the "Files and versions" tab of this repository).* --- ## 🚀 Usage & Basic Inference These weights are designed to be evaluated using the highly modular, Hydra-configured [HUD GitHub repository](https://github.com/ZivChen-Ty/HUD). ### Step 1: Prepare the Environment We recommend using Anaconda. Clone the repository and install dependencies: ```bash git clone https://github.com/iLearn-Lab/MM25-HUD cd MM25-HUD conda create -n hud python=3.8.10 -y conda activate hud conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt ``` ### Step 2: Download Model Weights Download the specific checkpoints from this Hugging Face repository and place them into your local directory. Ensure your dataset paths are correctly configured in `configs/machine/default.yaml`. ### Step 3: Run Evaluation To evaluate a trained model, use `test.py` and specify the target benchmark and checkpoint path via Hydra overrides: ```bash python3 test.py \ model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \ +test=webvid-covr # or fashioniq / cirr-all ``` --- ## ⚠️ Limitations & Notes - **Configuration:** HUD is entirely managed by **Hydra** and **Lightning Fabric**. Make sure to override configurations via the CLI or modify the YAML files in the `configs/` directory as needed. - **Hardware & Environment:** The project was specifically developed and tested on Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU. Using significantly different environment settings may impact reproducibility. --- ## 📝⭐️ Citation If you find our framework, code, or these weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repository and citing our ACM MM 2025 paper: ```bibtex @inproceedings{HUD, title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval}, author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili}, booktitle = {Proceedings of the ACM International Conference on Multimedia}, pages = {6143–6152}, year = {2025} } ```