File size: 4,066 Bytes
48ade83 1166756 48ade83 1166756 315e6c2 1166756 315e6c2 1644b63 315e6c2 1166756 315e6c2 1166756 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
library_name: transformers
pipeline_tag: zero-shot-image-classification
tags:
- vision
- uncertainty
- hyperbolic
---
# UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment
## Overview
UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling **semantic representativeness as uncertainty**.
Unlike conventional vision-language models, UNCHA explicitly captures the fact that:
* Not all parts contribute equally to representing a scene
* Some regions (e.g., main objects) are more informative than others
To address this, UNCHA introduces **uncertainty-aware alignment in hyperbolic space**, enabling better hierarchical and compositional reasoning.
- **Project Page:** [https://jeeit17.github.io/UNCHA-project_page/](https://jeeit17.github.io/UNCHA-project_page/)
- **Paper:** [Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models](https://arxiv.org/abs/2603.22042)
- **Code:** [https://github.com/jeeit17/UNCHA](https://github.com/jeeit17/UNCHA)
---
## Download
```python
from huggingface_hub import snapshot_download
repo_path = snapshot_download("hayeonkim/uncha")
print("Repo downloaded to:", repo_path)
```
---
## Key Idea
UNCHA models **part-to-whole semantic representativeness** using uncertainty:
* **Low uncertainty** → highly representative part
* **High uncertainty** → less informative / noisy part
This uncertainty is integrated into:
* **Contrastive loss** → adaptive temperature scaling
* **Entailment loss** → calibrated hierarchical structure with entropy regularization
This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.
---
## Model Details
* Architecture: Hyperbolic Vision-Language Model
* Backbone: ViT-S/16 or ViT-B/16
* Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)
---
## Performance
UNCHA achieves strong performance across multiple tasks:
### Zero-shot classification (ViT-B/16)
| Method | ImageNet | CIFAR-10 | CIFAR-100 | SUN397 | Caltech-101 | STL-10 |
|--------|:--------:|:--------:|:---------:|:------:|:-----------:|:------:|
| CLIP | 40.6 | 78.9 | 48.3 | 43.0 | 70.7 | 92.4 |
| MERU | 40.1 | 78.6 | 49.3 | 43.0 | 73.0 | 92.8 |
| HyCoCLIP | 45.8 | 88.8 | 60.1 | 57.2 | 81.3 | 95.0 |
| **UNCHA (Ours)** | **48.8** | **90.4** | **63.2** | **57.7** | **83.9** | **95.7** |
### Multi-object representation (ViT-B/16, mAP)
| Method | ComCo 2obj | ComCo 5obj | SimCo 2obj | SimCo 5obj | VOC | COCO |
|--------|:----------:|:----------:|:----------:|:----------:|:---:|:----:|
| CLIP | 77.55 | 80.22 | 77.15 | 88.48 | 78.56 | 53.94 |
| HyCoCLIP | 72.90 | 72.90 | 75.71 | 82.85 | 80.43 | 58.12 |
| **UNCHA (Ours)** | **77.92** | **81.18** | **79.72** | **90.65** | **82.14** | **59.43** |
---
## Training
Training requires preprocessing GRIT dataset:
```bash
python utils/prepare_GRIT_webdataset.py \
--raw_webdataset_path datasets/train/GRIT/raw \
--processed_webdataset_path datasets/train/GRIT/processed
```
Then run:
```bash
./scripts/train.sh \
--config configs/train_uncha_vit_b.py \
--num-gpus 4
```
---
## 📈 Evaluation
### Zero-shot classification
```bash
python scripts/evaluate.py \
--config configs/eval_zero_shot_classification.py \
--checkpoint-path /path/to/ckpt
```
### Retrieval
```bash
python scripts/evaluate.py \
--config configs/eval_zero_shot_retrieval.py \
--checkpoint-path /path/to/ckpt
```
---
## Citation
```bibtex
@inproceedings{kim2026uncha,
author = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
title = {UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models},
booktitle = {CVPR},
year = {2026},
}
```
---
## Acknowledgements
This work is supported by IITP, NRF, MSIT, and Seoul National University programs.
We also acknowledge prior works including MERU, HyCoCLIP, and ATMG. |