ProCrop / README.md

Add ProCrop model card

e23b994 verified 1 day ago

4.94 kB

	---
	license: apache-2.0
	tags:
	- image-cropping
	- aesthetic-cropping
	- computer-vision
	- retrieval-augmented
	- conditional-detr
	pipeline_tag: image-to-image
	library_name: pytorch
	datasets:
	- BWGZK/procrop_dataset
	language:
	- en
	---

	# ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

	[![arXiv](https://img.shields.io/badge/arXiv-2505.22490-b31b1b.svg)](https://arxiv.org/abs/2505.22490)
	[![GitHub](https://img.shields.io/badge/GitHub-ProCrop-blue)](https://github.com/BWGZK-keke/ProCrop)

	This is the headline supervised checkpoint for the AAAI 2026 paper "ProCrop: Learning Aesthetic Image Cropping from Professional Compositions" by Zhang et al.

	## Model Description

	ProCrop is a retrieval-augmented framework for aesthetic image cropping that leverages professional photography compositions as guidance. Given a query image, ProCrop:

	1. Retrieves compositionally similar professional images from a large database (AVA / CGL) using SAM embeddings and Faiss nearest-neighbor search.
	2. Fuses retrieved features with the query via cross-attention.
	3. Predicts diverse crop proposals ranked by aesthetic score using a Conditional DETR decoder.

	## Reported Performance (FLMS supervised setting)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| IoU \| 0.843 \|
	\| BDE (Displacement) \| 0.036 \|

	This checkpoint matches the FLMS row of Table 3 in the paper.

	## Checkpoint Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| File \| `procrop_flms_supervised.pth` \|
	\| Size \| 512 MB \|
	\| Original filename \| `checkpoint0008200.8425250053405762.pth` \|
	\| Trainable params \| ~44.8M \|
	\| Backbone \| ResNet-50 (DC5) + Transformer encoder/decoder \|
	\| Training data \| CPCDataset (supervised) + AVA retrieval references \|
	\| Evaluation \| FLMS test set, IoU = 0.8425 \|
	\| Training epoch \| 83 \|
	\| Crop queries \| 24 (Conditional DETR style) \|

	## How to Use

	### 1. Clone the GitHub repository

	```bash
	git clone https://github.com/BWGZK-keke/ProCrop.git
	cd ProCrop
	pip install -r requirements.txt
	pip install git+https://github.com/openai/CLIP.git
	```

	### 2. Download this checkpoint

	```python
	from huggingface_hub import hf_hub_download

	ckpt_path = hf_hub_download(
	repo_id="BWGZK/ProCrop",
	filename="procrop_flms_supervised.pth"
	)
	```

	Or with the CLI:
	```bash
	huggingface-cli download BWGZK/ProCrop procrop_flms_supervised.pth --local-dir ./checkpoints
	```

	### 3. Run inference on a single image

	```bash
	cd cropping
	python test_singleimage.py \
	--dataset_root /path/to/your/images \
	--retrieval_cache_dir /path/to/retrieval_tables \
	--retrieval_img_dir /path/to/CGL_images \
	--resume ./checkpoints/procrop_flms_supervised.pth \
	--crop_savepath ./results
	```

	### 4. Evaluate on FLMS

	```bash
	cd cropping
	python main_cpc.py \
	--dataset_root /path/to/FLMS \
	--retrieval_cache_dir /path/to/retrieval_tables \
	--resume ./checkpoints/procrop_flms_supervised.pth \
	--eval
	```

	You also need:
	- Precomputed retrieval tables from [BWGZK/procrop_dataset](https://huggingface.co/datasets/BWGZK/procrop_dataset)
	- SAM ViT-B checkpoint if training on GAIC/CAD: [download here](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)

	## Architecture

	ProCrop extends Conditional DETR with a retrieval augmentation module:

	- Backbone: ResNet-50 with dilated C5 stage
	- Encoder: 6-layer transformer encoder for the query image
	- Retrieval fusion: Cross-attention between query features and top-K retrieved SAM embeddings (64×256)
	- Decoder: 6-layer transformer decoder with N=24 learnable crop queries
	- Heads:
	- 4-dim bounding-box MLP (3 layers)
	- 1-dim aesthetic-score classification head (binary focal loss)
	- EMA self-distillation: Mean-teacher framework for weakly-supervised training on CAD

	Core implementation: [`cropping/models/conditional_detr_cpc.py`](https://github.com/BWGZK-keke/ProCrop/blob/main/cropping/models/conditional_detr_cpc.py)

	## Related Resources

	- Code (GitHub): https://github.com/BWGZK-keke/ProCrop
	- Paper (arXiv): https://arxiv.org/abs/2505.22490
	- Dataset (HuggingFace): https://huggingface.co/datasets/BWGZK/procrop_dataset
	- CAD dataset (242K weakly annotated images)
	- Precomputed retrieval tables
	- Pre-extracted SAM embedding databases

	## Citation

	```bibtex
	@article{ProCrop2025,
	title={ProCrop: Learning Aesthetic Image Cropping from Professional Compositions},
	author={Zhang, Ke and Ding, Tianyu and Jiang, Jiachen and Chen, Tianyi and Zharkov, Ilya and Patel, Vishal M. and Liang, Luming},
	journal={arXiv preprint arXiv:2505.22490},
	year={2025}
	}
	```

	## License

	Apache 2.0. The model builds on [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR), [RALF](https://github.com/CyberAgentAILab/RALF), and [Segment Anything](https://github.com/facebookresearch/segment-anything) — please consult their respective licenses.