crowd-counting-and-localization / README.md

Add files using upload-large-folder tool

b5e482a verified 5 days ago

7.71 kB

	---
	license: mit
	language:
	- en
	library_name: pytorch
	tags: [crowd-counting, localization, PET]
	---

	# Hierarchical Training on Partial Annotations Enables Density-Robust Crowd Counting and Localization

	## Abstract

	Reliable crowd analysis requires both accurate counting and precise head-point
	localization under severe density and scale variation. In practice, dense
	scenes exhibit heavy occlusion and perspective distortion, while the same
	camera can undergo abrupt distribution shifts over time due to zoom and
	viewpoint changes or event dynamics. We present the model, obtained by fine-tuning Point Query Tranformer(PET) on a
	curated, multi-source dataset with partial and heterogeneous annotations. Our
	training recipe combines (i) a hierarchical iterative loop that aligns count
	distributions across partial ground truth, fine-tuned predictions, and the
	pre-trained baseline to guide outlier-driven data refinement, (ii)
	multi-patch resolution training (128x128, 256x256, and 512x512) to reduce
	scale sensitivity, (iii) count-aware patch sampling to mitigate long-tailed
	density skew, and (iv) adaptive background-query loss weighting to prevent
	resolution-dependent background dominance. This approach improves F1 scores
	F1@4px and F1@8px on ShanghaiTech Part A (SHHA), ShanghaiTech Part B (SHHB),
	JHU-Crowd++, and UCF-QNRF, and exhibits more stable behavior during
	sparse-to-dense density transitions.

	For detailed data curation and training recipe, refer to our technical
	report: [Technical Report](TechnicalReport.pdf).

	## Evaluation and Results

	Across four benchmarks, PET-Finetuned shows the strongest overall transfer,
	with consistent gains in both counting and localization on SHHB, UCF-QNRF, and
	JHU-Crowd++. On SHHB, it reduces MAE/MSE to 13.794/22.163 from 19.472/29.651
	(PET-SHHA) and 19.579/28.398 (APGCC-SHHA), while increasing F1@8 to 0.820.
	The same pattern holds on UCF-QNRF (MAE 105.772, MSE 199.544, F1@8 0.738) and
	JHU-Crowd++ (MAE 74.778, MSE 271.886, F1@8 0.698), where PET-Finetuned
	outperforms both references by clear margins. On SHHA, counting error is higher
	than PET-SHHA and APGCC-SHHA (MAE 62.742 vs 48.879/48.725), but localization is
	best in table (F1@4 0.614, F1@8 0.794), indicating a stronger precision-recall
	balance for head-point prediction at both matching thresholds.

	> Note (evaluation protocol): PET-SHHA and APGCC-SHHA numbers in this
	> section can differ from values reported in the original papers. The original
	> works typically train one model per target dataset and evaluate in-domain. In
	> contrast, `PET-Finetuned(Ours)` is initialized from PET-SHHA weights and
	> fine-tuned in our framework. For cross-dataset baseline comparison, we use
	> the best public SHHA Part A checkpoints released by the authors for PET-SHHA
	> and APGCC-SHHA (APGCC publicly provides only the SHHA-best checkpoint).
	> Therefore, the PET-SHHA and APGCC-SHHA rows above reflect transfer from SHHA
	> initialization rather than per-dataset retraining. All metrics in this
	> section are evaluated at `threshold = 0.5`.

	### ShanghaiTech Part A (SHHA)

	\| Model \| MAE \| MSE \| AP@4px \| AR@4px \| F1@4px \| AP@8px \| AR@8px \| F1@8px \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| PET-Finetuned(Ours) \| 62.742 \| 102.996 \| 0.615 \| 0.613 \| 0.614 \| 0.796 \| 0.793 \| 0.794 \|
	\| PET-SHHA \| 48.879 \| 76.520 \| 0.596 \| 0.604 \| 0.600 \| 0.781 \| 0.792 \| 0.786 \|
	\| APGCC-SHHA \| 48.725 \| 76.721 \| 0.439 \| 0.428 \| 0.433 \| 0.773 \| 0.754 \| 0.764 \|

	### ShanghaiTech Part B (SHHB)

	\| Model \| MAE \| MSE \| AP@4px \| AR@4px \| F1@4px \| AP@8px \| AR@8px \| F1@8px \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| PET-Finetuned(Ours) \| 13.794 \| 22.163 \| 0.666 \| 0.596 \| 0.629 \| 0.869 \| 0.777 \| 0.820 \|
	\| PET-SHHA \| 19.472 \| 29.651 \| 0.640 \| 0.547 \| 0.590 \| 0.847 \| 0.724 \| 0.781 \|
	\| APGCC-SHHA \| 19.579 \| 28.398 \| 0.517 \| 0.441 \| 0.476 \| 0.837 \| 0.714 \| 0.771 \|

	### UCF-QNRF

	\| Model \| MAE \| MSE \| AP@4px \| AR@4px \| F1@4px \| AP@8px \| AR@8px \| F1@8px \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| PET-Finetuned(Ours) \| 105.772 \| 199.544 \| 0.533 \| 0.505 \| 0.519 \| 0.759 \| 0.719 \| 0.738 \|
	\| PET-SHHA \| 123.135 \| 240.943 \| 0.495 \| 0.487 \| 0.491 \| 0.708 \| 0.696 \| 0.702 \|
	\| APGCC-SHHA \| 126.763 \| 228.998 \| 0.311 \| 0.284 \| 0.297 \| 0.638 \| 0.583 \| 0.609 \|

	### JHU-Crowd++

	\| Model \| MAE \| MSE \| AP@4px \| AR@4px \| F1@4px \| AP@8px \| AR@8px \| F1@8px \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| PET-Finetuned(Ours) \| 74.778 \| 271.886 \| 0.467 \| 0.491 \| 0.479 \| 0.681 \| 0.715 \| 0.698 \|
	\| PET-SHHA \| 115.861 \| 393.281 \| 0.379 \| 0.449 \| 0.411 \| 0.582 \| 0.690 \| 0.632 \|
	\| APGCC-SHHA \| 102.461 \| 331.883 \| 0.303 \| 0.330 \| 0.316 \| 0.578 \| 0.630 \| 0.603 \|

	## Qualitative Analysis

	Full-resolution qualitative comparisons in the report use horizontal stacked
	panels ordered as `PET-Finetuned(Ours)`, `PET-SHHA`, and `APGCC-SHHA`, with
	point colors green, yellow, and red. Inference for these comparisons uses
	`threshold = 0.5` and `upper_bound = -1`. Qualitatively,
	`PET-Finetuned(Ours)` shows fewer sparse-scene false positives, stronger
	dense-scene recall under occlusion, and more stable localization under
	perspective and scale variation.

	[![Qualitative comparison for pexels-558331748-30295833](images/pexels-558331748-30295833.jpg)](images/pexels-558331748-30295833.jpg)

	[![Qualitative comparison for pexels-ilyasajpg-7038431](images/pexels-ilyasajpg-7038431.jpg)](images/pexels-ilyasajpg-7038431.jpg)

	[![Qualitative comparison for pexels-peter-almario-388108-19472286](images/pexels-peter-almario-388108-19472286.jpg)](images/pexels-peter-almario-388108-19472286.jpg)

	[![Qualitative comparison for pexels-rafeeque-kodungookaran-374579689-18755903](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)

	[![Qualitative comparison for pexels-wendywei-4945353](images/pexels-wendywei-4945353.jpg)](images/pexels-wendywei-4945353.jpg)

	## Model Inference

	Use the official PET repository to run single-image inference with this
	release model.

	1. Clone PET and move into the repository root.
	```bash
	git clone https://github.com/cxliu0/PET.git
	cd PET
	```
	2. Install dependencies.
	```bash
	pip install -r requirements.txt
	pip install safetensors pillow
	```
	3. Copy `test.py` from this release folder into the PET repository root.
	4. Place `PET_Finetuned.safetensors` in the PET repository root.
	5. Run inference (dummy example).
	```bash
	python test.py \
	--image_path path/to/image.jpg \
	--resume PET_Finetuned.safetensors \
	--device cpu \
	--output_json outputs/prediction.json \
	--output_image outputs/prediction.jpg
	```

	## Summary

	We present a practical adaptation of PET for density-robust
	crowd counting and head-point localization under partial and heterogeneous
	annotations. The training framework combines a hierarchical iterative
	fine-tuning loop with outlier-driven data refinement, mixed patch-resolution
	optimization (128x128/256x256/512x512), count-aware sampling for dense-scene
	emphasis, and adaptive background-query loss weighting to stabilize supervision
	across scales.

	Under the reported cross-dataset transfer protocol from SHHA initialization,
	the model achieves the strongest overall transfer on SHHB, UCF-QNRF, and
	JHU-Crowd++, while maintaining the best localization balance on SHHA at both
	matching thresholds. Qualitative evidence is consistent with these trends,
	showing fewer sparse-scene false positives and stronger dense-scene recall
	under occlusion and perspective variation.