iLearn-Lab
/

MM25-HUD

+---
+license: apache-2.0
+task_categories:
+- video-retrieval
+- image-retrieval
+tags:
+- composed-video-retrieval
+- composed-image-retrieval
+- multimodal-retrieval
+- vision-language
+- pytorch
+- acm-mm-2025
+---
+<a id="top"></a>
+<div align="center">
+  <h1>📹 (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval (Model Weights)</h1>
+  <div align="center">
+  <a target="_blank" href="https://zivchen-ty.github.io/">Zhiwei&#160;Chen</a><sup>1</sup>,
+  <a target="_blank" href="https://faculty.sdu.edu.cn/huyupeng1/zh_CN/index.htm">Yupeng&#160;Hu</a><sup>1&#9993</sup>,
+  <a target="_blank" href="https://lee-zixu.github.io/">Zixu&#160;Li</a><sup>1</sup>,
+  <a target="_blank" href="https://zhihfu.github.io/">Zhiheng&#160;Fu</a><sup>1</sup>,
+  <a target="_blank" href="https://haokunwen.github.io">Haokun&#160;Wen</a><sup>2</sup>,
+  <a target="_blank" href="https://homepage.hit.edu.cn/guanweili">Weili&#160;Guan</a><sup>2</sup>
+  </div>
+  <sup>1</sup>School of Software, Shandong University &#160&#160&#160</span>
+  <br />
+ <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), &#160&#160&#160</span>  <br />
+  <sup>&#9993&#160;</sup>Corresponding author&#160;&#160;</span>
+  <br/>
+  <p>
+      <a href="https://acmmm2025.org/"><img src="https://img.shields.io/badge/ACM_MM-2025-blue.svg?style=flat-square" alt="ACM MM 2025"></a>
+      <a href="https://doi.org/10.1145/3746027.3755445"><img alt='Paper' src="https://img.shields.io/badge/Paper-dl.acm-green.svg?style=flat-square"></a>
+      <a href="https://zivchen-ty.github.io/HUD.github.io/"><img alt='Project Page' src="https://img.shields.io/badge/Website-orange?style=flat-square"></a>
+      <a href="https://github.com/ZivChen-Ty/HUD"><img alt='GitHub' src="https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github"></a>
+  </p>
+</div>
+This repository hosts the official pre-trained model weights for **HUD**, a novel framework tackling both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks by explicitly leveraging the disparity in information density between modalities.
+---
+## 📌 Model Information
+### 1. Model Name
+**HUD** (Hierarchical Uncertainty-Aware Disambiguation Network) Checkpoints.
+### 2. Task Type & Applicable Tasks
+- **Task Type:** Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
+- **Applicable Tasks:** Retrieving a target video or image based on a reference visual input and a text modifier. HUD excels at addressing modification subject referring ambiguity and limited detailed semantic focus.
+### 3. Project Introduction
+**HUD** is the first framework that explicitly leverages the disparity in information density between video and text. It achieves State-of-the-Art (SOTA) performance through three key modules:
+- 🎯 **Holistic Pronoun Disambiguation:** Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate pronoun referents.
+- 🔍 **Atomistic Uncertainty Modeling:** Discerns key detail semantics via uncertainty modeling at the atomistic level, enhancing focus on fine-grained visual details.
+- ⚖️ **Holistic-to-Atomistic Alignment:** Adaptively aligns the composed query representation with the target media by incorporating a learnable similarity bias.
+### 4. Training Data Source & Hosted Weights
+The HUD framework supports both video and image retrieval benchmarks. This repository provides pre-trained checkpoints evaluated on the following datasets:
+* **CVR:** WebVid-CoVR dataset.
+* **CIR:** FashionIQ and CIRR datasets.
+*(Note: Download the respective `.ckpt` files hosted in the "Files and versions" tab of this repository).*
+---
+## 🚀 Usage & Basic Inference
+These weights are designed to be evaluated using the highly modular, Hydra-configured [HUD GitHub repository](https://github.com/ZivChen-Ty/HUD).
+### Step 1: Prepare the Environment
+We recommend using Anaconda. Clone the repository and install dependencies:
+```bash
+git clone https://github.com/iLearn-Lab/MM25-HUD
+cd MM25-HUD
+conda create -n hud python=3.8.10 -y
+conda activate hud
+conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+pip install -r requirements.txt
+```
+### Step 2: Download Model Weights
+Download the specific checkpoints from this Hugging Face repository and place them into your local directory. Ensure your dataset paths are correctly configured in `configs/machine/default.yaml`.
+### Step 3: Run Evaluation
+To evaluate a trained model, use `test.py` and specify the target benchmark and checkpoint path via Hydra overrides:
+```bash
+python3 test.py \
+    model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
+    +test=webvid-covr # or fashioniq / cirr-all
+```
+---
+## ⚠️ Limitations & Notes
+- **Configuration:** HUD is entirely managed by **Hydra** and **Lightning Fabric**. Make sure to override configurations via the CLI or modify the YAML files in the `configs/` directory as needed.
+- **Hardware & Environment:** The project was specifically developed and tested on Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU. Using significantly different environment settings may impact reproducibility.
+---
+## 📝⭐️ Citation
+If you find our framework, code, or these weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repository and citing our ACM MM 2025 paper:
+```bibtex
+@inproceedings{HUD,
+  title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval},
+  author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili},
+  booktitle = {Proceedings of the ACM International Conference on Multimedia},
+  pages = {6143–6152},
+  year = {2025}
+}
+```