| --- |
| license: cc-by-4.0 |
| datasets: |
| - HaHaJun1101/OACIRR |
| base_model: |
| - Salesforce/blip2-itm-vit-g |
| - Salesforce/blip2-itm-vit-g-coco |
| library_name: pytorch |
| tags: |
| - composed-image-retrieval |
| - object-anchored |
| - image-retrieval |
| - vision-language |
| - multimodal |
| - cvpr2026 |
| --- |
| |
| # **🔍 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval (CVPR 2026 Highlight)** |
|
|
| [**📖 Paper (arXiv)**](https://arxiv.org/abs/2604.05393) | [**🌐 Homepage**](https://hahajun1101.github.io/OACIR/) | [**🐙 Code (GitHub)**](https://github.com/HaHaJun1101/OACIR) | [**🤗 Dataset (OACIRR)**](https://huggingface.co/datasets/HaHaJun1101/OACIRR) | <a href="#downloading-the-adafocal-weights" style="color: red;">**🛜 Download Weights Now 👇**</a> |
|
|
| --- |
|
|
| ## 🔔 News |
| - **🌟 [2026-04-09]: Our paper has been selected as a ✨*Highlight*✨ at CVPR 2026!** |
| - **🔥 [2026-04-07]: The *AdaFocal* model checkpoints are officially released and are now available for use!** |
| - **🔥 [2026-04-03]: The full Training/Evaluation code are officially released on GitHub!** |
| - **🔥 [2026-03-25]: The OACIRR Benchmark is officially released on HuggingFace!** |
| - **🎉 [2026-02-21]: Our paper "Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval" has been accepted to CVPR 2026!** |
|
|
| --- |
|
|
| ## 🤖 Model Description |
|
|
| - **Architecture: ViT-G (EVA-CLIP) + BLIP-2 Q-Former + Context-Aware Attention Modulator (CAAM)** |
| - **Task: Fine-grained Composed Image Retrieval (CIR) with Instance-level Consistency** |
| - **Training Data: Exclusively trained on the [OACIRR Union Dataset](https://huggingface.co/datasets/HaHaJun1101/OACIRR/tree/main)** |
|
|
| --- |
|
|
| ## ⚙️ AdaFocal Framework |
|
|
| To address the core challenges of the OACIR task, we propose **AdaFocal**, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight **Context-Aware Attention Modulator (CAAM)**, enabling a nuanced balance between instance fidelity and compositional reasoning. |
|
|
| <p align="left"> |
| <img |
| src="https://huggingface.co/datasets/HaHaJun1101/OACIRR/resolve/main/figures/AdaFocal_framework.png" |
| width="100%" |
| alt="AdaFocal Framework Overview" |
| /> |
| </p> |
|
|
| Specifically, **AdaFocal** employs a two-stage reasoning process: *Contextual Perception* and *Adaptive Focus*. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion. |
|
|
| By dynamically re-weighting the attention distribution, **AdaFocal** seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval. |
|
|
| --- |
|
|
| ## 🚀 How to Use |
|
|
| <a name="downloading-the-adafocal-weights"></a> |
|
|
| ### 1. Download the AdaFocal Weights |
|
|
| You can download the checkpoints using Git LFS: |
| ```bash |
| cd OACIR |
| git lfs install |
| git clone https://huggingface.co/HaHaJun1101/AdaFocal ./checkpoints |
| ``` |
|
|
| Alternatively, download them via the Hugging Face Python API: |
| ```python |
| from huggingface_hub import snapshot_download |
| |
| snapshot_download(repo_id="HaHaJun1101/AdaFocal", local_dir="OACIR/checkpoints", repo_type="model") |
| ``` |
|
|
| ### 2. Run Evaluation via Official Codebase |
|
|
| Once downloaded, you can directly evaluate the models using the `evaluate.sh` script provided in our GitHub codebase. Open `evaluate.sh` and set the path to your downloaded weights: |
| ```bash |
| # Inside evaluate.sh |
| DATASET="Fashion" |
| MODEL_NAME="oacir_adafocal" |
| MODEL_WEIGHT="./checkpoints/adafocal_scalar.pt" # or adafocal_vector.pt |
| ``` |
| Then execute the script: |
| ```bash |
| bash evaluate.sh |
| ``` |
|
|
| --- |
|
|
| ## 🏆 Model Performance on OACIRR |
|
|
| We provide two variants of the **AdaFocal** weights. You can instantly reproduce the following results using our provided `evaluate.sh` script. |
|
|
| | Model Variant | Component Type | R<sub>ID</sub>@1 (Avg) | R@1 (Avg) | R@5 (Avg) | Overall Avg | Weights File | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:| |
| | **AdaFocal (Scalar β)** | Default Configuration | 81.52 | 63.08 | 90.98 | **78.53** | [`adafocal_scalar.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/blob/main/adafocal_scalar.pt) | |
| | **AdaFocal (Vector β)** | Vector Ablation | 81.99 | 63.06 | 91.35 | **78.80** | [`adafocal_vector.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/blob/main/adafocal_vector.pt) | |
|
|
| *Detailed breakdowns across the 4 domains:* |
|
|
| | Variant | <font color=#990000>Fashion</font> (R<sub>ID</sub>@1 / R@1) | <font color=#CC3300>Car</font> (R<sub>ID</sub>@1 / R@1) | <font color=#003399>Product</font> (R<sub>ID</sub>@1 / R@1) | <font color=#006633>Landmark</font> (R<sub>ID</sub>@1 / R@1) | |
| |:---|:---:|:---:|:---:|:---:| |
| | **Scalar β** | 73.68 / 64.45 | 78.39 / 54.85 | 91.36 / 73.85 | 82.65 / 59.18 | |
| | **Vector β** | 75.71 / 65.97 | 77.97 / 54.35 | 91.39 / 73.30 | 82.90 / 58.63 | |
|
|
| --- |
|
|
| ## ✒️ Citation |
|
|
| If you find our dataset, models, or codebase useful in your research, please consider citing our paper: |
| ```bibtex |
| @article{yang2026beyond, |
| title={Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval}, |
| author={Yang, Yuxin and Zhou, Yinan and Chen, Yuxin and Zhang, Ziqi and Ma, Zongyang and Yuan, Chunfeng and Li, Bing and Gao, Jun and Hu, Weiming}, |
| journal={arXiv preprint arXiv:2604.05393}, |
| year={2026} |
| } |
| ``` |