|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: pytorch |
|
|
tags: [crowd-counting, localization, PET] |
|
|
--- |
|
|
|
|
|
# Hierarchical Training on Partial Annotations Enables Density-Robust Crowd Counting and Localization |
|
|
|
|
|
## Abstract |
|
|
|
|
|
Reliable crowd analysis requires both accurate counting and precise head-point |
|
|
localization under severe density and scale variation. In practice, dense |
|
|
scenes exhibit heavy occlusion and perspective distortion, while the same |
|
|
camera can undergo abrupt distribution shifts over time due to zoom and |
|
|
viewpoint changes or event dynamics. We present the model, obtained by fine-tuning Point Query Tranformer(PET) on a |
|
|
curated, multi-source dataset with partial and heterogeneous annotations. Our |
|
|
training recipe combines (i) a hierarchical iterative loop that aligns count |
|
|
distributions across partial ground truth, fine-tuned predictions, and the |
|
|
pre-trained baseline to guide outlier-driven data refinement, (ii) |
|
|
multi-patch resolution training (128x128, 256x256, and 512x512) to reduce |
|
|
scale sensitivity, (iii) count-aware patch sampling to mitigate long-tailed |
|
|
density skew, and (iv) adaptive background-query loss weighting to prevent |
|
|
resolution-dependent background dominance. This approach improves F1 scores |
|
|
F1@4px and F1@8px on ShanghaiTech Part A (SHHA), ShanghaiTech Part B (SHHB), |
|
|
JHU-Crowd++, and UCF-QNRF, and exhibits more stable behavior during |
|
|
sparse-to-dense density transitions. |
|
|
|
|
|
For detailed data curation and training recipe, refer to our technical |
|
|
report: [Technical Report](TechnicalReport.pdf). |
|
|
|
|
|
## Evaluation and Results |
|
|
|
|
|
Across four benchmarks, PET-Finetuned shows the strongest overall transfer, |
|
|
with consistent gains in both counting and localization on SHHB, UCF-QNRF, and |
|
|
JHU-Crowd++. On SHHB, it reduces MAE/MSE to 13.794/22.163 from 19.472/29.651 |
|
|
(PET-SHHA) and 19.579/28.398 (APGCC-SHHA), while increasing F1@8 to 0.820. |
|
|
The same pattern holds on UCF-QNRF (MAE 105.772, MSE 199.544, F1@8 0.738) and |
|
|
JHU-Crowd++ (MAE 74.778, MSE 271.886, F1@8 0.698), where PET-Finetuned |
|
|
outperforms both references by clear margins. On SHHA, counting error is higher |
|
|
than PET-SHHA and APGCC-SHHA (MAE 62.742 vs 48.879/48.725), but localization is |
|
|
best in table (F1@4 0.614, F1@8 0.794), indicating a stronger precision-recall |
|
|
balance for head-point prediction at both matching thresholds. |
|
|
|
|
|
> **Note (evaluation protocol):** PET-SHHA and APGCC-SHHA numbers in this |
|
|
> section can differ from values reported in the original papers. The original |
|
|
> works typically train one model per target dataset and evaluate in-domain. In |
|
|
> contrast, `PET-Finetuned(Ours)` is initialized from PET-SHHA weights and |
|
|
> fine-tuned in our framework. For cross-dataset baseline comparison, we use |
|
|
> the best public SHHA Part A checkpoints released by the authors for PET-SHHA |
|
|
> and APGCC-SHHA (APGCC publicly provides only the SHHA-best checkpoint). |
|
|
> Therefore, the PET-SHHA and APGCC-SHHA rows above reflect transfer from SHHA |
|
|
> initialization rather than per-dataset retraining. All metrics in this |
|
|
> section are evaluated at `threshold = 0.5`. |
|
|
|
|
|
### ShanghaiTech Part A (SHHA) |
|
|
|
|
|
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px | |
|
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
|
| PET-Finetuned(Ours) | 62.742 | 102.996 | **0.615** | **0.613** | **0.614** | **0.796** | **0.793** | **0.794** | |
|
|
| PET-SHHA | 48.879 | **76.520** | 0.596 | 0.604 | 0.600 | 0.781 | 0.792 | 0.786 | |
|
|
| APGCC-SHHA | **48.725** | 76.721 | 0.439 | 0.428 | 0.433 | 0.773 | 0.754 | 0.764 | |
|
|
|
|
|
### ShanghaiTech Part B (SHHB) |
|
|
|
|
|
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px | |
|
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
|
| PET-Finetuned(Ours) | **13.794** | **22.163** | **0.666** | **0.596** | **0.629** | **0.869** | **0.777** | **0.820** | |
|
|
| PET-SHHA | 19.472 | 29.651 | 0.640 | 0.547 | 0.590 | 0.847 | 0.724 | 0.781 | |
|
|
| APGCC-SHHA | 19.579 | 28.398 | 0.517 | 0.441 | 0.476 | 0.837 | 0.714 | 0.771 | |
|
|
|
|
|
### UCF-QNRF |
|
|
|
|
|
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px | |
|
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
|
| PET-Finetuned(Ours) | **105.772** | **199.544** | **0.533** | **0.505** | **0.519** | **0.759** | **0.719** | **0.738** | |
|
|
| PET-SHHA | 123.135 | 240.943 | 0.495 | 0.487 | 0.491 | 0.708 | 0.696 | 0.702 | |
|
|
| APGCC-SHHA | 126.763 | 228.998 | 0.311 | 0.284 | 0.297 | 0.638 | 0.583 | 0.609 | |
|
|
|
|
|
### JHU-Crowd++ |
|
|
|
|
|
| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px | |
|
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
|
| PET-Finetuned(Ours) | **74.778** | **271.886** | **0.467** | **0.491** | **0.479** | **0.681** | **0.715** | **0.698** | |
|
|
| PET-SHHA | 115.861 | 393.281 | 0.379 | 0.449 | 0.411 | 0.582 | 0.690 | 0.632 | |
|
|
| APGCC-SHHA | 102.461 | 331.883 | 0.303 | 0.330 | 0.316 | 0.578 | 0.630 | 0.603 | |
|
|
|
|
|
## Qualitative Analysis |
|
|
|
|
|
Full-resolution qualitative comparisons in the report use horizontal stacked |
|
|
panels ordered as `PET-Finetuned(Ours)`, `PET-SHHA`, and `APGCC-SHHA`, with |
|
|
point colors green, yellow, and red. Inference for these comparisons uses |
|
|
`threshold = 0.5` and `upper_bound = -1`. Qualitatively, |
|
|
`PET-Finetuned(Ours)` shows fewer sparse-scene false positives, stronger |
|
|
dense-scene recall under occlusion, and more stable localization under |
|
|
perspective and scale variation. |
|
|
|
|
|
[](images/pexels-558331748-30295833.jpg) |
|
|
|
|
|
[](images/pexels-ilyasajpg-7038431.jpg) |
|
|
|
|
|
[](images/pexels-peter-almario-388108-19472286.jpg) |
|
|
|
|
|
[](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg) |
|
|
|
|
|
[](images/pexels-wendywei-4945353.jpg) |
|
|
|
|
|
## Model Inference |
|
|
|
|
|
Use the official PET repository to run single-image inference with this |
|
|
release model. |
|
|
|
|
|
1. Clone PET and move into the repository root. |
|
|
```bash |
|
|
git clone https://github.com/cxliu0/PET.git |
|
|
cd PET |
|
|
``` |
|
|
2. Install dependencies. |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
pip install safetensors pillow |
|
|
``` |
|
|
3. Copy `test.py` from this release folder into the PET repository root. |
|
|
4. Place `PET_Finetuned.safetensors` in the PET repository root. |
|
|
5. Run inference (dummy example). |
|
|
```bash |
|
|
python test.py \ |
|
|
--image_path path/to/image.jpg \ |
|
|
--resume PET_Finetuned.safetensors \ |
|
|
--device cpu \ |
|
|
--output_json outputs/prediction.json \ |
|
|
--output_image outputs/prediction.jpg |
|
|
``` |
|
|
|
|
|
## Summary |
|
|
|
|
|
We present a practical adaptation of PET for density-robust |
|
|
crowd counting and head-point localization under partial and heterogeneous |
|
|
annotations. The training framework combines a hierarchical iterative |
|
|
fine-tuning loop with outlier-driven data refinement, mixed patch-resolution |
|
|
optimization (128x128/256x256/512x512), count-aware sampling for dense-scene |
|
|
emphasis, and adaptive background-query loss weighting to stabilize supervision |
|
|
across scales. |
|
|
|
|
|
Under the reported cross-dataset transfer protocol from SHHA initialization, |
|
|
the model achieves the strongest overall transfer on SHHB, UCF-QNRF, and |
|
|
JHU-Crowd++, while maintaining the best localization balance on SHHA at both |
|
|
matching thresholds. Qualitative evidence is consistent with these trends, |
|
|
showing fewer sparse-scene false positives and stronger dense-scene recall |
|
|
under occlusion and perspective variation. |
|
|
|