hayeonkim commited on
Commit
315e6c2
·
0 Parent(s):

initial commit

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. README.md +173 -0
  3. uncha_vit_b.pth +3 -0
  4. uncha_vit_s.pth +3 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.pth filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment
2
+
3
+ ## Overview
4
+
5
+ UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling **semantic representativeness as uncertainty**.
6
+
7
+
8
+ Unlike conventional vision-language models, UNCHA explicitly captures the fact that:
9
+
10
+ * Not all parts contribute equally to representing a scene
11
+ * Some regions (e.g., main objects) are more informative than others
12
+
13
+ To address this, UNCHA introduces **uncertainty-aware alignment in hyperbolic space**, enabling better hierarchical and compositional reasoning.
14
+
15
+ Project Page: https://jeeit17.github.io/UNCHA-project_page/
16
+ Paper: https://arxiv.org/abs/2603.22042
17
+
18
+ ---
19
+
20
+ ## Key Idea
21
+
22
+ UNCHA models **part-to-whole semantic representativeness** using uncertainty:
23
+
24
+ * **Low uncertainty** → highly representative part
25
+ * **High uncertainty** → less informative / noisy part
26
+
27
+ This uncertainty is integrated into:
28
+
29
+ * **Contrastive loss** → adaptive temperature scaling
30
+ * **Entailment loss** → calibrated hierarchical structure with entropy regularization
31
+
32
+ This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.
33
+
34
+ ---
35
+
36
+ ## Model Details
37
+
38
+ * Architecture: Hyperbolic Vision-Language Model
39
+ * Backbone: ViT-S/16 or ViT-B/16
40
+ * Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)
41
+
42
+ ---
43
+
44
+ ## Performance
45
+
46
+ UNCHA achieves strong performance across multiple tasks:
47
+
48
+ ### Zero-shot classification (ViT-B/16)
49
+
50
+ | Method | ImageNet | CIFAR-10 | CIFAR-100 | SUN397 | Caltech-101 | STL-10 |
51
+ |--------|:--------:|:--------:|:---------:|:------:|:-----------:|:------:|
52
+ | CLIP | 40.6 | 78.9 | 48.3 | 43.0 | 70.7 | 92.4 |
53
+ | MERU | 40.1 | 78.6 | 49.3 | 43.0 | 73.0 | 92.8 |
54
+ | HyCoCLIP | 45.8 | 88.8 | 60.1 | 57.2 | 81.3 | 95.0 |
55
+ | **UNCHA (Ours)** | **48.8** | **90.4** | **63.2** | **57.7** | **83.9** | **95.7** |
56
+
57
+ ### Multi-object representation (ViT-B/16, mAP)
58
+
59
+ | Method | ComCo 2obj | ComCo 5obj | SimCo 2obj | SimCo 5obj | VOC | COCO |
60
+ |--------|:----------:|:----------:|:----------:|:----------:|:---:|:----:|
61
+ | CLIP | 77.55 | 80.22 | 77.15 | 88.48 | 78.56 | 53.94 |
62
+ | HyCoCLIP | 72.90 | 72.90 | 75.71 | 82.85 | 80.43 | 58.12 |
63
+ | **UNCHA (Ours)** | **77.92** | **81.18** | **79.72** | **90.65** | **82.14** | **59.43** |
64
+
65
+ ---
66
+
67
+ ## Usage
68
+
69
+ ### Load model
70
+
71
+ ```python
72
+ from transformers import AutoModel
73
+
74
+ model = AutoModel.from_pretrained("hayeonkim/uncha")
75
+ ```
76
+
77
+ ---
78
+
79
+ ### Example (feature extraction)
80
+
81
+ ```python
82
+ import torch
83
+ from PIL import Image
84
+ from transformers import AutoProcessor, AutoModel
85
+
86
+ model = AutoModel.from_pretrained("hayeonkim/uncha")
87
+ processor = AutoProcessor.from_pretrained("hayeonkim/uncha")
88
+
89
+ image = Image.open("example.jpg")
90
+ inputs = processor(images=image, return_tensors="pt")
91
+
92
+ with torch.no_grad():
93
+ outputs = model(**inputs)
94
+
95
+ image_embedding = outputs.last_hidden_state
96
+ ```
97
+
98
+ ---
99
+
100
+ ## Training
101
+
102
+ Training requires preprocessing GRIT dataset:
103
+
104
+ ```bash
105
+ python utils/prepare_GRIT_webdataset.py \
106
+ --raw_webdataset_path datasets/train/GRIT/raw \
107
+ --processed_webdataset_path datasets/train/GRIT/processed
108
+ ```
109
+
110
+ Then run:
111
+
112
+ ```bash
113
+ ./scripts/train.sh \
114
+ --config configs/train_uncha_vit_b.py \
115
+ --num-gpus 4
116
+ ```
117
+
118
+ ---
119
+
120
+ ## 📈 Evaluation
121
+
122
+ ### Zero-shot classification
123
+
124
+ ```bash
125
+ python scripts/evaluate.py \
126
+ --config configs/eval_zero_shot_classification.py \
127
+ --checkpoint-path /path/to/ckpt
128
+ ```
129
+
130
+ ### Retrieval
131
+
132
+ ```bash
133
+ python scripts/evaluate.py \
134
+ --config configs/eval_zero_shot_retrieval.py \
135
+ --checkpoint-path /path/to/ckpt
136
+ ```
137
+
138
+ ---
139
+
140
+ ## Method Summary
141
+
142
+ UNCHA improves compositional alignment by:
143
+
144
+ 1. Modeling representativeness via uncertainty
145
+ 2. Weighting parts adaptively during contrastive learning
146
+ 3. Structuring embeddings using hyperbolic geometry
147
+
148
+ This enables:
149
+
150
+ * Better part–whole hierarchy
151
+ * Improved multi-object reasoning
152
+ * Stronger zero-shot generalization
153
+
154
+ ---
155
+
156
+ ## Citation
157
+
158
+ ```bibtex
159
+ @inproceedings{kim2026uncha,
160
+ author = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
161
+ title = {UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment with Part-to-Whole Semantic Representativeness},
162
+ booktitle = {CVPR},
163
+ year = {2026},
164
+ }
165
+ ```
166
+ ---
167
+
168
+ ## Acknowledgements
169
+
170
+ This work is supported by IITP, NRF, MSIT, and Seoul National University programs.
171
+ We also acknowledge prior works including MERU, HyCoCLIP, and ATMG.
172
+
173
+ ---
uncha_vit_b.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8e83c1fa2f2f1d0b5920c157a9c41a5dbf53d0841030d5e97e6ed954336d4cc
3
+ size 1801077399
uncha_vit_s.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26bf4e6b43e77863bbe91876f4ea8c03fceaa7582bc1c3f6bb4e826b129ce461
3
+ size 1029727959