Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,114 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
datasets:
|
| 4 |
+
- HaHaJun1101/OACIRR
|
| 5 |
+
base_model:
|
| 6 |
+
- Salesforce/blip2-itm-vit-g
|
| 7 |
+
- Salesforce/blip2-itm-vit-g-coco
|
| 8 |
+
library_name: pytorch
|
| 9 |
+
tags:
|
| 10 |
+
- composed-image-retrieval
|
| 11 |
+
- object-anchored
|
| 12 |
+
- image-retrieval
|
| 13 |
+
- vision-language
|
| 14 |
+
- multimodal
|
| 15 |
+
- cvpr2026
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# **🔍 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval (CVPR 2026)**
|
| 19 |
+
|
| 20 |
+
[**📖 Paper (arXiv)**](https://arxiv.org) | [**🌐 Homepage**](https://hahajun1101.github.io/OACIR/) | [**🐙 Code (GitHub)**](https://github.com/HaHaJun1101/OACIR) | [**🤗 Dataset (OACIRR)**](https://huggingface.co/datasets/HaHaJun1101/OACIRR) | <a href="#downloading-the-adafocal-weights" style="color: red;">**🛜 Download Weights Now 👇**</a>
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 🔔 News
|
| 25 |
+
- **🔥 [2026-04-07]: The *AdaFocal* model checkpoints are officially released and are now available for use!**
|
| 26 |
+
- **🔥 [2026-04-03]: The full Training/Evaluation code are officially released on GitHub!**
|
| 27 |
+
- **🔥 [2026-03-25]: The OACIRR Benchmark is officially released on HuggingFace!**
|
| 28 |
+
- **🎉 [2026-02-21]: Our paper "Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval" has been accepted to CVPR 2026!**
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## 🤖 Model Description
|
| 33 |
+
|
| 34 |
+
- **Architecture: ViT-G (EVA-CLIP) + BLIP-2 Q-Former + Context-Aware Attention Modulator (CAAM)**
|
| 35 |
+
- **Task: Fine-grained Composed Image Retrieval (CIR) with Instance-level Consistency**
|
| 36 |
+
- **Training Data: Exclusively trained on the [OACIRR Union Dataset](https://huggingface.co/datasets/HaHaJun1101/OACIRR)**
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## ⚙️ AdaFocal Framework
|
| 41 |
+
|
| 42 |
+
To address the core challenges of the OACIR task, we propose **AdaFocal**, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight **Context-Aware Attention Modulator (CAAM)**, enabling a nuanced balance between instance fidelity and compositional reasoning.
|
| 43 |
+
|
| 44 |
+
<p align="left">
|
| 45 |
+
<img
|
| 46 |
+
src="https://huggingface.co/datasets/HaHaJun1101/OACIRR/resolve/main/figures/AdaFocal_framework.png"
|
| 47 |
+
width="90%"
|
| 48 |
+
alt="AdaFocal Framework Overview"
|
| 49 |
+
/>
|
| 50 |
+
</p>
|
| 51 |
+
|
| 52 |
+
Specifically, **AdaFocal** employs a two-stage reasoning process: *Contextual Perception* and *Adaptive Focus*. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion.
|
| 53 |
+
|
| 54 |
+
By dynamically re-weighting the attention distribution, **AdaFocal** seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## 🚀 How to Use
|
| 59 |
+
|
| 60 |
+
<a name="downloading-the-adafocal-weights"></a>
|
| 61 |
+
|
| 62 |
+
### 1. Download the AdaFocal Weights
|
| 63 |
+
|
| 64 |
+
You can download the checkpoints using Git LFS:
|
| 65 |
+
```bash
|
| 66 |
+
cd OACIR
|
| 67 |
+
git lfs install
|
| 68 |
+
git clone https://huggingface.co/HaHaJun1101/AdaFocal ./checkpoints
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Alternatively, download them via the Hugging Face Python API:
|
| 72 |
+
```python
|
| 73 |
+
from huggingface_hub import snapshot_download
|
| 74 |
+
|
| 75 |
+
snapshot_download(repo_id="HaHaJun1101/AdaFocal", local_dir="OACIR/checkpoints", repo_type="model")
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### 2. Run Evaluation via Official Codebase
|
| 79 |
+
|
| 80 |
+
Once downloaded, you can directly evaluate the models using the `evaluate.sh` script provided in our GitHub codebase. Open `evaluate.sh` and set the path to your downloaded weights:
|
| 81 |
+
```bash
|
| 82 |
+
# Inside evaluate.sh
|
| 83 |
+
DATASET="Fashion"
|
| 84 |
+
MODEL_NAME="oacir_adafocal"
|
| 85 |
+
MODEL_WEIGHT="./checkpoints/adafocal_scalar.pt" # or adafocal_vector.pt
|
| 86 |
+
```
|
| 87 |
+
Then execute the script:
|
| 88 |
+
```bash
|
| 89 |
+
bash evaluate.sh
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## 🏆 Model Performance on OACIRR
|
| 95 |
+
|
| 96 |
+
We provide two variants of the **AdaFocal** weights. You can instantly reproduce the following results using our provided `evaluate.sh` script.
|
| 97 |
+
|
| 98 |
+
| Model Variant | Component Type | R<sub>ID</sub>@1 (Avg) | R@1 (Avg) | R@5 (Avg) | Overall Avg | Weights File |
|
| 99 |
+
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 100 |
+
| **AdaFocal (Scalar β)** | Default Configuration | 81.52 | 63.08 | 90.98 | **78.53** | [📥 `adafocal_scalar.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/resolve/main/adafocal_scalar.pt) |
|
| 101 |
+
| **AdaFocal (Vector β)** | Vector Ablation | 81.99 | 63.06 | 91.35 | **78.80** | [📥 `adafocal_vector.pt`](https://huggingface.co/HaHaJun1101/AdaFocal/resolve/main/adafocal_vector.pt) |
|
| 102 |
+
|
| 103 |
+
*Detailed breakdowns across the 4 domains:*
|
| 104 |
+
|
| 105 |
+
| Variant | <font color=#990000>Fashion</font> (R<sub>ID</sub>@1 / R@1) | <font color=#CC3300>Car</font> (R<sub>ID</sub>@1 / R@1) | <font color=#003399>Product</font> (R<sub>ID</sub>@1 / R@1) | <font color=#006633>Landmark</font> (R<sub>ID</sub>@1 / R@1) |
|
| 106 |
+
|:---|:---:|:---:|:---:|:---:|
|
| 107 |
+
| **Scalar β** | 73.68 / 64.45 | 78.39 / 54.85 | 91.36 / 73.85 | 82.65 / 59.18 |
|
| 108 |
+
| **Vector β** | 75.71 / 65.97 | 77.97 / 54.35 | 91.39 / 73.30 | 82.90 / 58.63 |
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## ✒️ Citation
|
| 113 |
+
|
| 114 |
+
If you find our dataset, models, or code useful in your research, please consider citing our paper.
|