Image Segmentation
Transformers
Safetensors
English
calico
text-generation
computer-vision
semantic-segmentation
co-segmentation
part-segmentation
multi-image-reasoning
vision-language
Instructions to use PLAN-Lab/CALICO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PLAN-Lab/CALICO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="PLAN-Lab/CALICO")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("PLAN-Lab/CALICO", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - computer-vision | |
| - image-segmentation | |
| - semantic-segmentation | |
| - co-segmentation | |
| - part-segmentation | |
| - multi-image-reasoning | |
| - vision-language | |
| pipeline_tag: image-segmentation | |
| datasets: | |
| - PLAN-Lab/MixedParts | |
| # CALICO [CVPR 2025] | |
| CALICO is a large vision-language model for part-focused semantic co-segmentation. Given a pair of images and a natural-language prompt, CALICO identifies common objects, common parts, or unique parts and predicts segmentation masks for the referenced regions. | |
| This repository contains the released CALICO checkpoint. It is intended to be loaded with the [CALICO codebase](https://github.com/PLAN-Lab/CALICO). | |
| ## Model Details | |
| - **Model name:** CALICO | |
| - **Dataset:** Mixed Parts | |
| - **Task:** part-focused semantic co-segmentation and segmentation-grounded vision-language reasoning | |
| - **Checkpoint format:** merged Hugging Face checkpoint with safetensors shards | |
| - **Vision-language image encoder:** Q-Former over EVA-CLIP-G visual features | |
| - **Grounding decoder:** SAM ViT-H mask decoder | |
| - **Correspondence features:** frozen DINOv2 features used by CALICO correspondence modules | |
| CALICO includes two correspondence modules in the language-model forward pass: | |
| - **Correspondence Extraction Module (CEM):** fuses Q-Former visual patch embeddings with frozen DINOv2 correspondence features. | |
| - **Correspondence Adaptation Module (CAM):** adapts Q-Former queries from the current language state and reintegrates correspondence-aware visual features into image-token hidden states. | |
| ## Intended Use | |
| Use CALICO for research on multi-image part-focused segmentation, semantic co-segmentation, and vision-language grounding. The model is released for evaluation and fine-tuning with the CALICO repository. | |
| ## Quick Start | |
| Install the CALICO environment and prepare Mixed Parts following the repository docs: | |
| - [Installation](https://github.com/PLAN-Lab/CALICO/blob/main/docs/INSTALL.md) | |
| - [Data preparation](https://github.com/PLAN-Lab/CALICO/blob/main/docs/DATA.md) | |
| - [Mixed Parts dataset](https://huggingface.co/datasets/PLAN-Lab/MixedParts) | |
| Run evaluation from the CALICO repository root: | |
| ```bash | |
| python evaluate.py \ | |
| --merged_ckpt_path PLAN-Lab/CALICO \ | |
| --dataset_dir ./data \ | |
| --output_save_path ./evaluate_results/calico_mixed_parts \ | |
| --val_dataset "MixedPartsObjectVal|MixedPartsPartVal" \ | |
| --multi_image_filepath_prefix ./data/mixed_parts_data/mixed_parts_test.json \ | |
| --mode test \ | |
| --compute_metrics | |
| ``` | |
| `--dataset_dir ./data` should contain: | |
| ```text | |
| data/ | |
| βββ coco_2017/ | |
| βββ mixed_parts_data/ | |
| ``` | |
| ## Training and Fine-Tuning | |
| Fine-tuning is supported through `train.py` in the CALICO repository. See [docs/TRAINING.md](https://github.com/PLAN-Lab/CALICO/blob/main/docs/TRAINING.md) for recommended arguments and data layout. | |
| ## Citation | |
| If you use CALICO or Mixed Parts, please cite: | |
| ```bibtex | |
| @article{nguyen2025calico, | |
| title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models}, | |
| author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini}, | |
| journal={In Proceedings for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | |
| year={2025} | |
| } | |
| ``` | |