File size: 3,273 Bytes
8295c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b46b200
8295c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language:
- en
library_name: transformers
tags:
- computer-vision
- image-segmentation
- semantic-segmentation
- co-segmentation
- part-segmentation
- multi-image-reasoning
- vision-language
pipeline_tag: image-segmentation
datasets:
- PLAN-Lab/MixedParts
---

# CALICO [CVPR 2025]

CALICO is a large vision-language model for part-focused semantic co-segmentation. Given a pair of images and a natural-language prompt, CALICO identifies common objects, common parts, or unique parts and predicts segmentation masks for the referenced regions.

This repository contains the released CALICO checkpoint. It is intended to be loaded with the [CALICO codebase](https://github.com/PLAN-Lab/CALICO).

## Model Details

- **Model name:** CALICO
- **Dataset:** Mixed Parts
- **Task:** part-focused semantic co-segmentation and segmentation-grounded vision-language reasoning
- **Checkpoint format:** merged Hugging Face checkpoint with safetensors shards
- **Vision-language image encoder:** Q-Former over EVA-CLIP-G visual features
- **Grounding decoder:** SAM ViT-H mask decoder
- **Correspondence features:** frozen DINOv2 features used by CALICO correspondence modules

CALICO includes two correspondence modules in the language-model forward pass:

- **Correspondence Extraction Module (CEM):** fuses Q-Former visual patch embeddings with frozen DINOv2 correspondence features.
- **Correspondence Adaptation Module (CAM):** adapts Q-Former queries from the current language state and reintegrates correspondence-aware visual features into image-token hidden states.

## Intended Use

Use CALICO for research on multi-image part-focused segmentation, semantic co-segmentation, and vision-language grounding. The model is released for evaluation and fine-tuning with the CALICO repository.

## Quick Start

Install the CALICO environment and prepare Mixed Parts following the repository docs:

- [Installation](https://github.com/PLAN-Lab/CALICO/blob/main/docs/INSTALL.md)
- [Data preparation](https://github.com/PLAN-Lab/CALICO/blob/main/docs/DATA.md)
- [Mixed Parts dataset](https://huggingface.co/datasets/PLAN-Lab/MixedParts)

Run evaluation from the CALICO repository root:

```bash
python evaluate.py \
  --merged_ckpt_path PLAN-Lab/CALICO \
  --dataset_dir ./data \
  --output_save_path ./evaluate_results/calico_mixed_parts \
  --val_dataset "MixedPartsObjectVal|MixedPartsPartVal" \
  --multi_image_filepath_prefix ./data/mixed_parts_data/mixed_parts_test.json \
  --mode test \
  --compute_metrics
```

`--dataset_dir ./data` should contain:

```text
data/
├── coco_2017/
└── mixed_parts_data/
```

## Training and Fine-Tuning

Fine-tuning is supported through `train.py` in the CALICO repository. See [docs/TRAINING.md](https://github.com/PLAN-Lab/CALICO/blob/main/docs/TRAINING.md) for recommended arguments and data layout.

## Citation

If you use CALICO or Mixed Parts, please cite:

```bibtex
@article{nguyen2025calico,
  title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
  author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
  journal={In Proceedings for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}
```