AdaFocal / README.md

Update README.md

b18f419 verified about 9 hours ago

5.6 kB

	---
	license: cc-by-4.0
	datasets:
	- HaHaJun1101/OACIRR
	base_model:
	- Salesforce/blip2-itm-vit-g
	- Salesforce/blip2-itm-vit-g-coco
	library_name: pytorch
	tags:
	- composed-image-retrieval
	- object-anchored
	- image-retrieval
	- vision-language
	- multimodal
	- cvpr2026
	---

	# 🔍 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval (CVPR 2026 Highlight)

	[📖 Paper (arXiv)](https://arxiv.org/abs/2604.05393) \| [🌐 Homepage](https://hahajun1101.github.io/OACIR/) \| [🐙 Code (GitHub)](https://github.com/HaHaJun1101/OACIR) \| [🤗 Dataset (OACIRR)](https://huggingface.co/datasets/HaHaJun1101/OACIRR) \| <a href="#downloading-the-adafocal-weights" style="color: red;">🛜 Download Weights Now 👇</a>

	---

	## 🔔 News
	- *🌟 [2026-04-09]: Our paper has been selected as a ✨Highlight✨ at CVPR 2026!*
	- *🔥 [2026-04-07]: The AdaFocal* model checkpoints are officially released and are now available for use!**
	- 🔥 [2026-04-03]: The full Training/Evaluation code are officially released on GitHub!
	- 🔥 [2026-03-25]: The OACIRR Benchmark is officially released on HuggingFace!
	- 🎉 [2026-02-21]: Our paper "Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval" has been accepted to CVPR 2026!

	---

	## 🤖 Model Description

	- Architecture: ViT-G (EVA-CLIP) + BLIP-2 Q-Former + Context-Aware Attention Modulator (CAAM)
	- Task: Fine-grained Composed Image Retrieval (CIR) with Instance-level Consistency
	- Training Data: Exclusively trained on the [OACIRR Union Dataset](https://huggingface.co/datasets/HaHaJun1101/OACIRR/tree/main)

	---

	## ⚙️ AdaFocal Framework

	To address the core challenges of the OACIR task, we propose AdaFocal, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight Context-Aware Attention Modulator (CAAM), enabling a nuanced balance between instance fidelity and compositional reasoning.

	<p align="left">
	<img
	src="https://huggingface.co/datasets/HaHaJun1101/OACIRR/resolve/main/figures/AdaFocal_framework.png"
	width="100%"
	alt="AdaFocal Framework Overview"
	/>
	</p>

	Specifically, AdaFocal employs a two-stage reasoning process: Contextual Perception and Adaptive Focus. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion.

	By dynamically re-weighting the attention distribution, AdaFocal seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval.

	---

	## 🚀 How to Use

	<a name="downloading-the-adafocal-weights"></a>

	### 1. Download the AdaFocal Weights

	You can download the checkpoints using Git LFS:
	```bash
	cd OACIR
	git lfs install
	git clone https://huggingface.co/HaHaJun1101/AdaFocal ./checkpoints
	```

	Alternatively, download them via the Hugging Face Python API:
	```python
	from huggingface_hub import snapshot_download

	snapshot_download(repo_id="HaHaJun1101/AdaFocal", local_dir="OACIR/checkpoints", repo_type="model")
	```

	### 2. Run Evaluation via Official Codebase

	Once downloaded, you can directly evaluate the models using the `evaluate.sh` script provided in our GitHub codebase. Open `evaluate.sh` and set the path to your downloaded weights:
	```bash
	# Inside evaluate.sh
	DATASET="Fashion"
	MODEL_NAME="oacir_adafocal"
	MODEL_WEIGHT="./checkpoints/adafocal_scalar.pt" # or adafocal_vector.pt
	```
	Then execute the script:
	```bash
	bash evaluate.sh
	```

	---

	## 🏆 Model Performance on OACIRR

	We provide two variants of the AdaFocal weights. You can instantly reproduce the following results using our provided `evaluate.sh` script.

	\| Model Variant \| Component Type \| R<sub>ID</sub>@1 (Avg) \| R@1 (Avg) \| R@5 (Avg) \| Overall Avg \| Weights File \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| AdaFocal (Scalar β) \| Default Configuration \| 81.52 \| 63.08 \| 90.98 \| 78.53 \| [`adafocal_scalar.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/blob/main/adafocal_scalar.pt) \|
	\| AdaFocal (Vector β) \| Vector Ablation \| 81.99 \| 63.06 \| 91.35 \| 78.80 \| [`adafocal_vector.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/blob/main/adafocal_vector.pt) \|

	Detailed breakdowns across the 4 domains:

	\| Variant \| <font color=#990000>Fashion</font> (R<sub>ID</sub>@1 / R@1) \| <font color=#CC3300>Car</font> (R<sub>ID</sub>@1 / R@1) \| <font color=#003399>Product</font> (R<sub>ID</sub>@1 / R@1) \| <font color=#006633>Landmark</font> (R<sub>ID</sub>@1 / R@1) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|
	\| Scalar β \| 73.68 / 64.45 \| 78.39 / 54.85 \| 91.36 / 73.85 \| 82.65 / 59.18 \|
	\| Vector β \| 75.71 / 65.97 \| 77.97 / 54.35 \| 91.39 / 73.30 \| 82.90 / 58.63 \|

	---

	## ✒️ Citation

	If you find our dataset, models, or codebase useful in your research, please consider citing our paper:
	```bibtex
	@article{yang2026beyond,
	title={Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval},
	author={Yang, Yuxin and Zhou, Yinan and Chen, Yuxin and Zhang, Ziqi and Ma, Zongyang and Yuan, Chunfeng and Li, Bing and Gao, Jun and Hu, Weiming},
	journal={arXiv preprint arXiv:2604.05393},
	year={2026}
	}
	```