uncha / README.md

Add pipeline tag and GitHub link (#1)

1166756 10 days ago

4.07 kB

	---
	library_name: transformers
	pipeline_tag: zero-shot-image-classification
	tags:
	- vision
	- uncertainty
	- hyperbolic
	---

	# UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment

	## Overview

	UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling semantic representativeness as uncertainty.


	Unlike conventional vision-language models, UNCHA explicitly captures the fact that:

	* Not all parts contribute equally to representing a scene
	* Some regions (e.g., main objects) are more informative than others

	To address this, UNCHA introduces uncertainty-aware alignment in hyperbolic space, enabling better hierarchical and compositional reasoning.

	- Project Page: [https://jeeit17.github.io/UNCHA-project_page/](https://jeeit17.github.io/UNCHA-project_page/)
	- Paper: [Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models](https://arxiv.org/abs/2603.22042)
	- Code: [https://github.com/jeeit17/UNCHA](https://github.com/jeeit17/UNCHA)

	---
	## Download

	```python
	from huggingface_hub import snapshot_download

	repo_path = snapshot_download("hayeonkim/uncha")

	print("Repo downloaded to:", repo_path)
	```

	---

	## Key Idea

	UNCHA models part-to-whole semantic representativeness using uncertainty:

	* Low uncertainty → highly representative part
	* High uncertainty → less informative / noisy part

	This uncertainty is integrated into:

	* Contrastive loss → adaptive temperature scaling
	* Entailment loss → calibrated hierarchical structure with entropy regularization

	This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.

	---

	## Model Details

	* Architecture: Hyperbolic Vision-Language Model
	* Backbone: ViT-S/16 or ViT-B/16
	* Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)

	---

	## Performance

	UNCHA achieves strong performance across multiple tasks:

	### Zero-shot classification (ViT-B/16)

	\| Method \| ImageNet \| CIFAR-10 \| CIFAR-100 \| SUN397 \| Caltech-101 \| STL-10 \|
	\|--------\|:--------:\|:--------:\|:---------:\|:------:\|:-----------:\|:------:\|
	\| CLIP \| 40.6 \| 78.9 \| 48.3 \| 43.0 \| 70.7 \| 92.4 \|
	\| MERU \| 40.1 \| 78.6 \| 49.3 \| 43.0 \| 73.0 \| 92.8 \|
	\| HyCoCLIP \| 45.8 \| 88.8 \| 60.1 \| 57.2 \| 81.3 \| 95.0 \|
	\| UNCHA (Ours) \| 48.8 \| 90.4 \| 63.2 \| 57.7 \| 83.9 \| 95.7 \|

	### Multi-object representation (ViT-B/16, mAP)

	\| Method \| ComCo 2obj \| ComCo 5obj \| SimCo 2obj \| SimCo 5obj \| VOC \| COCO \|
	\|--------\|:----------:\|:----------:\|:----------:\|:----------:\|:---:\|:----:\|
	\| CLIP \| 77.55 \| 80.22 \| 77.15 \| 88.48 \| 78.56 \| 53.94 \|
	\| HyCoCLIP \| 72.90 \| 72.90 \| 75.71 \| 82.85 \| 80.43 \| 58.12 \|
	\| UNCHA (Ours) \| 77.92 \| 81.18 \| 79.72 \| 90.65 \| 82.14 \| 59.43 \|


	---

	## Training

	Training requires preprocessing GRIT dataset:

	```bash
	python utils/prepare_GRIT_webdataset.py \
	--raw_webdataset_path datasets/train/GRIT/raw \
	--processed_webdataset_path datasets/train/GRIT/processed
	```

	Then run:

	```bash
	./scripts/train.sh \
	--config configs/train_uncha_vit_b.py \
	--num-gpus 4
	```

	---

	## 📈 Evaluation

	### Zero-shot classification

	```bash
	python scripts/evaluate.py \
	--config configs/eval_zero_shot_classification.py \
	--checkpoint-path /path/to/ckpt
	```

	### Retrieval

	```bash
	python scripts/evaluate.py \
	--config configs/eval_zero_shot_retrieval.py \
	--checkpoint-path /path/to/ckpt
	```

	---

	## Citation

	```bibtex
	@inproceedings{kim2026uncha,
	author = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
	title = {UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models},
	booktitle = {CVPR},
	year = {2026},
	}
	```
	---

	## Acknowledgements

	This work is supported by IITP, NRF, MSIT, and Seoul National University programs.
	We also acknowledge prior works including MERU, HyCoCLIP, and ATMG.