Transformers
English
biology
CV
images
animals
image classification
fine-grained classification
birds
pets
interpretable
vision transformer
prompt tuning
explainable AI
interpretable machine learning
saliency map
dino
Instructions to use imageomics/Prompt-CAM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use imageomics/Prompt-CAM with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("imageomics/Prompt-CAM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 15,120 Bytes
4e90915 66dcc70 4e90915 66dcc70 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | ---
license: mit
language:
- en
tags:
- biology
- CV
- images
- animals
- image classification
- fine-grained classification
- birds
- pets
- interpretable
- transformers
- vision transformer
- prompt tuning
- explainable AI
- interpretable machine learning
- saliency map
- vision transformer
- dino
metrics:
- accuracy
model_description: >-
Prompt-CAM is a simple yet effective interpretable transformer for fine-grained
image classification and analysis. It injects class-specific prompts into any
pretrained Vision Transformer (ViT) to produce per-class attention maps without
requiring any architectural modifications. This allows for exploration of fine-grained
trait distinctions between different specified species.
---
# Model Card for Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
Prompt-CAM checkpoints trained with a ViT-B DINO and DINOv2 backbone on fine-grained image classification datasets (CUB-200-2011, Oxford-IIIT Pet, Stanford Cars, Stanford Dogs, Birds-525). These checkpoints can be used to produce per-class attention maps to explore fine-grained trait distinctions between different specified species.
<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). And further altered to suit Imageomics Institute needs -->
## Model Details
### Model Description
Prompt-CAM is a **simple yet effective interpretable transformer** that requires no architectural modifications to pretrained ViTs. It injects **class-specific prompts** into any ViT to make attention maps interpretable for fine-grained analysis. The prompts act as class queries, and the resulting cross-attention between prompts and image patches produces human-interpretable heatmaps highlighting the visual traits the model uses to distinguish each class.
- **Developed by:** Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, and Wei-Lun Chao
- **Model type:** Vision Transformer with class-specific prompt injection
- **License:** MIT
- **Fine-tuned from model:** [ViT-B DINO](https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth) and [ViT-B DINOv2](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth)
### Model Sources
- **Repository:** [Imageomics/Prompt_CAM](https://github.com/Imageomics/Prompt_CAM)
- **Paper:** [Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (CVPR 2025)](https://doi.org/10.1109/CVPR52734.2025.00413), [Open-Access](https://openaccess.thecvf.com/content/CVPR2025/papers/Chowdhury_Prompt-CAM_Making_Vision_Transformers_Interpretable_for_Fine-Grained_Analysis_CVPR_2025_paper.pdf) <!-- https://arxiv.org/pdf/2501.09333-->
- **Demo:** Interactive Colab demo [](https://colab.research.google.com/drive/1co1P5LXSVb-g0hqv8Selfjq4WGxSpIFe?usp=sharing) and local [demo.ipynb](https://github.com/Imageomics/Prompt_CAM/blob/main/demo.ipynb)
## Uses
### Direct Use
Prompt-CAM can be used directly for:
- **Fine-grained image classification** β predicting the species/category of an image among a large set of visually similar classes.
- **Visual interpretability** β generating per-class attention heatmaps that highlight which image regions and traits the model uses for each class, supporting scientific understanding of what distinguishes species or categories.
### Downstream Use
Prompt-CAM can be extended to new fine-grained datasets by following the [extension instructions](https://github.com/Imageomics/Prompt_CAM#to-add-a-new-dataset) in the repository. It is well-suited for biological image datasets where understanding discriminative traits (e.g., plumage patterns, markings) is as important as classification accuracy.
### Out-of-Scope Use
- The model is not designed for general-purpose object detection or segmentation.
- Performance may degrade significantly on image domains far from the training distribution (e.g., applying a bird-trained model to medical images).
## Bias, Risks, and Limitations
- Prompt-CAM inherits any biases present in the pretrained ViT backbone (DINO / DINOv2) and in the fine-tuning datasets (e.g., geographic or photographic biases in CUB-200-2011).
- Classification performance is tied to image quality; low-resolution or heavily occluded images may yield less reliable predictions and attention maps.
### Recommendations
Users should treat attention heatmaps as model explanations to be verified with domain expertise rather than ground-truth biological annotations.
## How to Get Started with the Model
Set up the environment (using [`env_setup.sh`](https://github.com/Imageomics/Prompt_CAM/blob/main/env_setup.sh)):
```bash
conda create -n prompt_cam python=3.10
conda activate prompt_cam
source env_setup.sh
```
Download a checkpoint from this repository (see [Training Data](#training-data) table below) and place it in `checkpoints/{backbone}/{dataset}/model.pt`. Then visualize class-specific attention maps by running:
```bash
CUDA_VISIBLE_DEVICES=0 python visualize.py \
--config ./experiment/config/prompt_cam/dino/cub/args.yaml \
--checkpoint ./checkpoints/dino/cub/model.pt \
--vis_cls 23
```
Output heatmaps are saved to `visualization/dino/cub/class_23/`.
For an interactive experience, see the [Colab demo](https://colab.research.google.com/drive/1co1P5LXSVb-g0hqv8Selfjq4WGxSpIFe?usp=sharing) or [demo.ipynb](https://github.com/Imageomics/Prompt_CAM/blob/main/demo.ipynb).
## Training Details
### Training Data
Each checkpoint is trained on the official training split of its respective dataset.
| Backbone | Dataset | Checkpoint |
|----------|---------|------------|
| DINO (ViT-B/16) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | [Prompt_CAM_checkpoint_dino_cub.pt](https://huggingface.co/imageomics/Prompt-CAM/resolve/main/Prompt_CAM_checkpoint_dino_cub.pt) |
| DINO (ViT-B/16) | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) | Coming soon |
| DINO (ViT-B/16) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
| DINO (ViT-B/16) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
| DINO (ViT-B/16) | [Birds-525](https://www.kaggle.com/datasets/gpiosenka/100-bird-species) | Coming soon |
| DINOv2 (ViT-B/14) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | Coming soon |
| DINOv2 (ViT-B/14) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
| DINOv2 (ViT-B/14) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
### Training Procedure
Only the class-specific prompt tokens are trained; the pretrained ViT backbone weights are kept frozen. The number of prompt tokens equals the number of classes in the dataset.
Dataset images are organized as:
```
dataset_name/
βββ train/
β βββ 001.ClassName/
β β βββ img1.jpg
β β βββ ...
β βββ ...
βββ val/
βββ 001.ClassName/
β βββ img2.jpg
β βββ ...
βββ ...
```
Please see the [Data Preparation section](https://github.com/Imageomics/Prompt_CAM#data-preparation) of our GitHub repository for more details on training and validation setup, including preprocessing scripts.
#### Preprocessing
| Step | Train | Val |
|------|-------|-----|
| Resize | 240 Γ 240 | 224 Γ 224 |
| Crop | RandomCrop 224 Γ 224 | β |
| Flip | RandomHorizontalFlip | β |
| Normalize | ImageNet Inception mean/std | ImageNet Inception mean/std |
#### Training Hyperparameters
| Hyperparameter | Value |
|----------------|-------|
| Optimizer | SGD |
| Learning rate | 0.001 β 0.005 (dataset-dependent) |
| Min LR | 1e-6 |
| Momentum | 0.9 |
| Weight decay | 0.001 |
| Epochs | 100 β 130 |
| Warmup epochs | 20 |
| Warmup LR init | 1e-6 |
| Batch size | 16 |
| Drop path rate | 0.0 |
| VPT dropout | 0.0 |
| Precision | fp32 |
#### Speeds, Sizes, Times
- **Hardware:** NVIDIA RTX A6000
- **Training time:** β€ 1 hour per checkpoint
- **Checkpoint size:** ~350 MB (ViT-B backbone + prompt tokens)
## Evaluation
To evaluate a checkpoint on the test split, run:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 main.py \
--config ./experiment/config/prompt_cam/dino/cub/args.yaml \
--gpu_num 4
```
### Testing Data, Factors & Metrics
#### Testing Data
Each model is evaluated on the official test (val) split of its training dataset.
| Backbone | Dataset | Checkpoint |
|----------|---------|------------|
| DINO (ViT-B/16) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | [Prompt_CAM_checkpoint_dino_cub.pt](https://huggingface.co/imageomics/Prompt-CAM/resolve/main/Prompt_CAM_checkpoint_dino_cub.pt) |
| DINO (ViT-B/16) | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) | Coming soon |
| DINO (ViT-B/16) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
| DINO (ViT-B/16) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
| DINO (ViT-B/16) | [Birds-525](https://www.kaggle.com/datasets/gpiosenka/100-bird-species) | Coming soon |
| DINOv2 (ViT-B/14) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | Coming soon |
| DINOv2 (ViT-B/14) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
| DINOv2 (ViT-B/14) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
#### Metrics
Top-1 classification accuracy on the official test split.
### Results
| Backbone | Dataset | acc@1 |
|----------|---------|-------|
| DINO (ViT-B/16) | CUB-200-2011 | 73.2 |
| DINO (ViT-B/16) | Stanford Cars | 83.2 |
| DINO (ViT-B/16) | Stanford Dogs | 81.1 |
| DINO (ViT-B/16) | Oxford-IIIT Pet | 91.3 |
| DINO (ViT-B/16) | Birds-525 | 98.8 |
| DINOv2 (ViT-B/14) | CUB-200-2011 | 74.1 |
| DINOv2 (ViT-B/14) | Stanford Dogs | 81.3 |
| DINOv2 (ViT-B/14) | Oxford-IIIT Pet | 92.7 |
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://doi.org/10.48550/arXiv.1910.09700).
- **Hardware Type:** NVIDIA RTX A6000
- **Hours used:** β€ 1 hour per checkpoint
- **Cloud Provider:** N/A (local cluster)
- **Compute Region:** United States
- **Carbon Emitted:** ~0.13 kg CO<sub>2</sub> eq. per checkpoint
## Technical Specifications
### Model Architecture and Objective
Prompt-CAM adds a set of learnable class-specific prompt tokens to the input sequence of a frozen pretrained ViT. Each prompt token corresponds to one class. During the forward pass, the self-attention between each class prompt and the image patch tokens produces a spatial attention map that reveals which patches are most relevant for that class. Only the prompt tokens are optimized during training; all ViT parameters remain frozen.
### Compute Infrastructure
#### Hardware
NVIDIA RTX A6000 GPU.
#### Software
- Python 3.10
- PyTorch
- timm 1.0.24
See [`env_setup.sh`](https://github.com/Imageomics/Prompt_CAM/blob/main/env_setup.sh) for the full environment.
## Citation
**BibTeX:**
[](https://doi.org/10.1109/CVPR52734.2025.00413) [](https://openaccess.thecvf.com/content/CVPR2025/papers/Chowdhury_Prompt-CAM_Making_Vision_Transformers_Interpretable_for_Fine-Grained_Analysis_CVPR_2025_paper.pdf)
If you find our work helpful, please consider citing our paper:
```bibtex
@inproceedings{Chowdhury_2025_CVPR,
author = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
title = {Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {4375--4385},
doi = {10.1109/CVPR52734.2025.00413}
}
```
Model Citation:
```bibtex
@software{Chowdhury_Prompt_CAM_2025,
author = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
license = {MIT},
title = {{Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis}},
doi = {<doi once generated>},
url = {https://huggingface.co/imageomics/Prompt-CAM},
version = {1.0.0},
month = {June},
year = {2026}
}
```
**APA:**
Paper:
Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis. *2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4375β4385. doi:10.1109/CVPR52734.2025.00413
Model Citation:
Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (Version 1.0.0). https://huggingface.co/imageomics/Prompt-CAM
## Acknowledgements
Our model builds on pretrained [DINO](https://github.com/facebookresearch/dino) and [DINOv2](https://github.com/facebookresearch/dinov2) ViT backbones. We thank the authors for their excellent work.
We also acknowledge:
- [VPT](https://github.com/KMnP/vpt)
- [PETL_VISION](https://github.com/OSU-MLB/PETL_Vision)
This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
## Model Card Authors
Arpita Chowdhury
## Model Card Contact
Arpita Chowdhury β [GitHub Issues](https://github.com/Imageomics/Prompt_CAM/issues) - email: arpitachowdhurytonney@gmail.com
|