File size: 4,066 Bytes
48ade83
 
1166756
48ade83
 
 
 
 
1166756
315e6c2
 
 
 
 
 
 
 
 
 
 
 
 
 
1166756
 
 
315e6c2
1644b63
 
 
 
 
 
 
 
 
 
 
315e6c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1166756
315e6c2
 
 
 
 
 
 
 
 
1166756
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
library_name: transformers
pipeline_tag: zero-shot-image-classification
tags:
- vision
- uncertainty
- hyperbolic
---

# UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment

## Overview

UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling **semantic representativeness as uncertainty**.


Unlike conventional vision-language models, UNCHA explicitly captures the fact that:

* Not all parts contribute equally to representing a scene
* Some regions (e.g., main objects) are more informative than others

To address this, UNCHA introduces **uncertainty-aware alignment in hyperbolic space**, enabling better hierarchical and compositional reasoning.

- **Project Page:** [https://jeeit17.github.io/UNCHA-project_page/](https://jeeit17.github.io/UNCHA-project_page/)
- **Paper:** [Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models](https://arxiv.org/abs/2603.22042)
- **Code:** [https://github.com/jeeit17/UNCHA](https://github.com/jeeit17/UNCHA)

--- 
## Download 

```python
from huggingface_hub import snapshot_download

repo_path = snapshot_download("hayeonkim/uncha")

print("Repo downloaded to:", repo_path)
```

---

## Key Idea

UNCHA models **part-to-whole semantic representativeness** using uncertainty:

* **Low uncertainty** → highly representative part
* **High uncertainty** → less informative / noisy part

This uncertainty is integrated into:

* **Contrastive loss** → adaptive temperature scaling
* **Entailment loss** → calibrated hierarchical structure with entropy regularization

This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.

---

## Model Details

* Architecture: Hyperbolic Vision-Language Model
* Backbone: ViT-S/16 or ViT-B/16
* Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)

---

## Performance

UNCHA achieves strong performance across multiple tasks:

### Zero-shot classification (ViT-B/16)

| Method | ImageNet | CIFAR-10 | CIFAR-100 | SUN397 | Caltech-101 | STL-10 |
|--------|:--------:|:--------:|:---------:|:------:|:-----------:|:------:|
| CLIP   | 40.6 | 78.9 | 48.3 | 43.0 | 70.7 | 92.4 |
| MERU   | 40.1 | 78.6 | 49.3 | 43.0 | 73.0 | 92.8 |
| HyCoCLIP | 45.8 | 88.8 | 60.1 | 57.2 | 81.3 | 95.0 |
| **UNCHA (Ours)** | **48.8** | **90.4** | **63.2** | **57.7** | **83.9** | **95.7** |

### Multi-object representation (ViT-B/16, mAP)

| Method | ComCo 2obj | ComCo 5obj | SimCo 2obj | SimCo 5obj | VOC | COCO |
|--------|:----------:|:----------:|:----------:|:----------:|:---:|:----:|
| CLIP   | 77.55 | 80.22 | 77.15 | 88.48 | 78.56 | 53.94 |
| HyCoCLIP | 72.90 | 72.90 | 75.71 | 82.85 | 80.43 | 58.12 |
| **UNCHA (Ours)** | **77.92** | **81.18** | **79.72** | **90.65** | **82.14** | **59.43** |


---

## Training

Training requires preprocessing GRIT dataset:

```bash
python utils/prepare_GRIT_webdataset.py \
    --raw_webdataset_path datasets/train/GRIT/raw \
    --processed_webdataset_path datasets/train/GRIT/processed
```

Then run:

```bash
./scripts/train.sh \
    --config configs/train_uncha_vit_b.py \
    --num-gpus 4
```

---

## 📈 Evaluation

### Zero-shot classification

```bash
python scripts/evaluate.py \
    --config configs/eval_zero_shot_classification.py \
    --checkpoint-path /path/to/ckpt
```

### Retrieval

```bash
python scripts/evaluate.py \
    --config configs/eval_zero_shot_retrieval.py \
    --checkpoint-path /path/to/ckpt
```

---

## Citation

```bibtex
@inproceedings{kim2026uncha,
  author    = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
  title     = {UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models},
  booktitle = {CVPR},
  year      = {2026},
}
```
---

## Acknowledgements

This work is supported by IITP, NRF, MSIT, and Seoul National University programs.
We also acknowledge prior works including MERU, HyCoCLIP, and ATMG.