NikhilSandy's picture
Add files using upload-large-folder tool
b5e482a verified
---
license: mit
language:
- en
library_name: pytorch
tags: [crowd-counting, localization, PET]
---
# Hierarchical Training on Partial Annotations Enables Density-Robust Crowd Counting and Localization
## Abstract
Reliable crowd analysis requires both accurate counting and precise head-point
localization under severe density and scale variation. In practice, dense
scenes exhibit heavy occlusion and perspective distortion, while the same
camera can undergo abrupt distribution shifts over time due to zoom and
viewpoint changes or event dynamics. We present the model, obtained by fine-tuning Point Query Tranformer(PET) on a
curated, multi-source dataset with partial and heterogeneous annotations. Our
training recipe combines (i) a hierarchical iterative loop that aligns count
distributions across partial ground truth, fine-tuned predictions, and the
pre-trained baseline to guide outlier-driven data refinement, (ii)
multi-patch resolution training (128x128, 256x256, and 512x512) to reduce
scale sensitivity, (iii) count-aware patch sampling to mitigate long-tailed
density skew, and (iv) adaptive background-query loss weighting to prevent
resolution-dependent background dominance. This approach improves F1 scores
F1@4px and F1@8px on ShanghaiTech Part A (SHHA), ShanghaiTech Part B (SHHB),
JHU-Crowd++, and UCF-QNRF, and exhibits more stable behavior during
sparse-to-dense density transitions.
For detailed data curation and training recipe, refer to our technical
report: [Technical Report](TechnicalReport.pdf).
## Evaluation and Results
Across four benchmarks, PET-Finetuned shows the strongest overall transfer,
with consistent gains in both counting and localization on SHHB, UCF-QNRF, and
JHU-Crowd++. On SHHB, it reduces MAE/MSE to 13.794/22.163 from 19.472/29.651
(PET-SHHA) and 19.579/28.398 (APGCC-SHHA), while increasing F1@8 to 0.820.
The same pattern holds on UCF-QNRF (MAE 105.772, MSE 199.544, F1@8 0.738) and
JHU-Crowd++ (MAE 74.778, MSE 271.886, F1@8 0.698), where PET-Finetuned
outperforms both references by clear margins. On SHHA, counting error is higher
than PET-SHHA and APGCC-SHHA (MAE 62.742 vs 48.879/48.725), but localization is
best in table (F1@4 0.614, F1@8 0.794), indicating a stronger precision-recall
balance for head-point prediction at both matching thresholds.
> **Note (evaluation protocol):** PET-SHHA and APGCC-SHHA numbers in this
> section can differ from values reported in the original papers. The original
> works typically train one model per target dataset and evaluate in-domain. In
> contrast, `PET-Finetuned(Ours)` is initialized from PET-SHHA weights and
> fine-tuned in our framework. For cross-dataset baseline comparison, we use
> the best public SHHA Part A checkpoints released by the authors for PET-SHHA
> and APGCC-SHHA (APGCC publicly provides only the SHHA-best checkpoint).
> Therefore, the PET-SHHA and APGCC-SHHA rows above reflect transfer from SHHA
> initialization rather than per-dataset retraining. All metrics in this
> section are evaluated at `threshold = 0.5`.
### ShanghaiTech Part A (SHHA)
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| PET-Finetuned(Ours) | 62.742 | 102.996 | **0.615** | **0.613** | **0.614** | **0.796** | **0.793** | **0.794** |
| PET-SHHA | 48.879 | **76.520** | 0.596 | 0.604 | 0.600 | 0.781 | 0.792 | 0.786 |
| APGCC-SHHA | **48.725** | 76.721 | 0.439 | 0.428 | 0.433 | 0.773 | 0.754 | 0.764 |
### ShanghaiTech Part B (SHHB)
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| PET-Finetuned(Ours) | **13.794** | **22.163** | **0.666** | **0.596** | **0.629** | **0.869** | **0.777** | **0.820** |
| PET-SHHA | 19.472 | 29.651 | 0.640 | 0.547 | 0.590 | 0.847 | 0.724 | 0.781 |
| APGCC-SHHA | 19.579 | 28.398 | 0.517 | 0.441 | 0.476 | 0.837 | 0.714 | 0.771 |
### UCF-QNRF
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| PET-Finetuned(Ours) | **105.772** | **199.544** | **0.533** | **0.505** | **0.519** | **0.759** | **0.719** | **0.738** |
| PET-SHHA | 123.135 | 240.943 | 0.495 | 0.487 | 0.491 | 0.708 | 0.696 | 0.702 |
| APGCC-SHHA | 126.763 | 228.998 | 0.311 | 0.284 | 0.297 | 0.638 | 0.583 | 0.609 |
### JHU-Crowd++
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| PET-Finetuned(Ours) | **74.778** | **271.886** | **0.467** | **0.491** | **0.479** | **0.681** | **0.715** | **0.698** |
| PET-SHHA | 115.861 | 393.281 | 0.379 | 0.449 | 0.411 | 0.582 | 0.690 | 0.632 |
| APGCC-SHHA | 102.461 | 331.883 | 0.303 | 0.330 | 0.316 | 0.578 | 0.630 | 0.603 |
## Qualitative Analysis
Full-resolution qualitative comparisons in the report use horizontal stacked
panels ordered as `PET-Finetuned(Ours)`, `PET-SHHA`, and `APGCC-SHHA`, with
point colors green, yellow, and red. Inference for these comparisons uses
`threshold = 0.5` and `upper_bound = -1`. Qualitatively,
`PET-Finetuned(Ours)` shows fewer sparse-scene false positives, stronger
dense-scene recall under occlusion, and more stable localization under
perspective and scale variation.
[![Qualitative comparison for pexels-558331748-30295833](images/pexels-558331748-30295833.jpg)](images/pexels-558331748-30295833.jpg)
[![Qualitative comparison for pexels-ilyasajpg-7038431](images/pexels-ilyasajpg-7038431.jpg)](images/pexels-ilyasajpg-7038431.jpg)
[![Qualitative comparison for pexels-peter-almario-388108-19472286](images/pexels-peter-almario-388108-19472286.jpg)](images/pexels-peter-almario-388108-19472286.jpg)
[![Qualitative comparison for pexels-rafeeque-kodungookaran-374579689-18755903](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)
[![Qualitative comparison for pexels-wendywei-4945353](images/pexels-wendywei-4945353.jpg)](images/pexels-wendywei-4945353.jpg)
## Model Inference
Use the official PET repository to run single-image inference with this
release model.
1. Clone PET and move into the repository root.
```bash
git clone https://github.com/cxliu0/PET.git
cd PET
```
2. Install dependencies.
```bash
pip install -r requirements.txt
pip install safetensors pillow
```
3. Copy `test.py` from this release folder into the PET repository root.
4. Place `PET_Finetuned.safetensors` in the PET repository root.
5. Run inference (dummy example).
```bash
python test.py \
--image_path path/to/image.jpg \
--resume PET_Finetuned.safetensors \
--device cpu \
--output_json outputs/prediction.json \
--output_image outputs/prediction.jpg
```
## Summary
We present a practical adaptation of PET for density-robust
crowd counting and head-point localization under partial and heterogeneous
annotations. The training framework combines a hierarchical iterative
fine-tuning loop with outlier-driven data refinement, mixed patch-resolution
optimization (128x128/256x256/512x512), count-aware sampling for dense-scene
emphasis, and adaptive background-query loss weighting to stabilize supervision
across scales.
Under the reported cross-dataset transfer protocol from SHHA initialization,
the model achieves the strongest overall transfer on SHHB, UCF-QNRF, and
JHU-Crowd++, while maintaining the best localization balance on SHHA at both
matching thresholds. Qualitative evidence is consistent with these trends,
showing fewer sparse-scene false positives and stronger dense-scene recall
under occlusion and perspective variation.