File size: 10,369 Bytes
e7766c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
license: cc
datasets:
- sshao0516/CrowdHuman
- aveocr/Market-1501-v15.09.15.zip
language:
- en
base_model:
- lakeAGI/PersonViT
- Ultralytics/YOLO26
tags:
- person_search
- PRW
- Tranformer
- PersonViT
- YOLO26
- ablation
---
# Detector Training, file: Detector_Training.ipynb

## Overview

This notebook presents a structured ablation study on the transferability of **YOLO26** to pedestrian detection on the **PRW** (Person Re-identification in the Wild) dataset.
It constitutes the **detection stage** of a two-stage person search pipeline, where high recall is a primary design objective since any missed detection propagates as an unrecoverable error to the downstream Re-ID module.

The study evaluates the impact of:
- CrowdHuman intermediate pre-training vs. direct COCO initialisation
- Full fine-tuning vs. partial fine-tuning (frozen backbone layers 0-9)
- Model scale: YOLO26-Small vs. YOLO26-Large

## Requirements

### Platform

The notebook was developed and executed on **Kaggle** with a **dual NVIDIA T4 GPU** configuration (16 GB VRAM each). Multi-GPU training is enabled by default via `cfg.use_multi_gpu = True`.
To run on a single GPU or CPU, set this flag to `False` in the `Config` class.

### Dependencies

All required packages are installed in the first cell:
- ultralytics, opencv-python-headless, scipy, pandas, tqdm


### Input Datasets

Before running the notebook, add the following two datasets as input sources in the Kaggle session (Notebook -> Input -> Add Input):

| Dataset | Kaggle URL | Version |
|---|---|---|
| CrowdHuman | https://www.kaggle.com/datasets/leducnhuan/crowdhuman | 1 |
| PRW -- Person Re-identification in the Wild | https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild | 1 |

The dataset paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.

## Notebook Structure

| Phase | Description |
|---|---|
| 0 - Baseline | YOLO26-Small COCO zero-shot eval on PRW |
| 1 - CrowdHuman Pre-training | Fine-tune Small on CrowdHuman, then eval on PRW |
| 2 - Strategy Comparison | Full FT vs. Partial FT (freeze backbone layers 0-9) |
| 3 - Scale-up | Best strategy applied to YOLO26-Large |
| Final | Cross-model comparison: metrics, params, GFLOPs, speed |

## Outputs

All results are saved under `/kaggle/working/yolo_ablation/`:
- `all_results.json` -- incremental results registry, persisted after each run
- `all_results.csv` -- final summary table sorted by mAP@0.5
- `plots/` -- all generated figures (bar charts, radar charts, training curves)

Model checkpoints are saved under `/kaggle/working/yolo_runs/{run_name}/weights/`.

## Key Results

| Model | Strategy | mAP@0.5 (%) | Recall (%) | Params (M) | Latency (ms) |
|---|---|---|---|---|---|
| YOLO26-Large | Full FT | 96.24 | 91.38 | 26.2 | 27.88 |
| YOLO26-Small | Full FT | 94.96 | 89.32 | 9.9 | 7.72 |
| YOLO26-Small | Partial FT | 94.91 | 89.33 | 9.9 | 7.65 |
| YOLO26-Small | Zero-shot (CH) | 88.23 | 83.41 | 9.9 | 6.85 |
| YOLO26-Small | Zero-shot (COCO) | 85.82 | 79.42 | 10.0 | 6.29 |

The Large model achieves the highest recall (91.38%) at the cost of 3.6x higher latency.
The Small model with full or partial fine-tuning offers a competitive alternative for latency-constrained deployments, with recall above 89% at under 8 ms per image.

## Reproducibility

A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a valid checkpoint already exists, training is skipped and the existing weights are used.
Results can be fully regenerated from saved checkpoints without re-running any training phase.

## Estimated Runtime

Full end-to-end execution (dataset conversion, all training phases, evaluation, and plotting) takes approximately **8-10 hours** on a Kaggle dual T4 session, depending on dataset I/O speed and early stopping behaviour.

# Re-ID Training, file: ReIdentificator_Training.ipynb


## Overview


This notebook presents a structured ablation study on the fine-tuning of **PersonViT** for person re-identification on the **Person Re-identification in the Wild** (PRW) dataset.
It constitutes the **Re-ID stage** of a two-stage person search pipeline, where the input is a set of pedestrian crops produced by the upstream YOLO26 detector and the output is an L2-normalised embedding used for cosine-similarity gallery retrieval.

The study follows a **small-first, scale-up** design: the full ablation over fine-tuning strategies and loss functions is conducted on the lightweight **ViT-Small** (22 M parameters) to minimise GPU time, and only the winning configuration is then replicated on **ViT-Base** (86 M parameters). This reduces total compute by approximately 3x compared to running the ablation directly on ViT-Base, while preserving full causal interpretability of each experimental variable.

The study evaluates the impact of:
- Source domain selection (Duke, Market-1501, MSMT17, Occluded-Duke) in zero-shot evaluation
- Fine-tuning strategy: Full FT vs. Partial FT vs. Freeze+Retrain
- Metric learning loss function: TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss
- Model scale: ViT-Small vs. ViT-Base


## Requirements


### Platform

The notebook was developed and executed on **Kaggle** with a single **NVIDIA T4 GPU** (16 GB VRAM). Mixed-precision training (fp16) is enabled by default via `cfg.use_amp = True`.


### Dependencies


All required packages are installed in the first cell:
- albumentations, opencv-python-headless, scipy
- torchmetrics, timm, einops, yacs
- pytorch-metric-learning
- thop


### Input Datasets and Models


Before running the notebook, add the following dataset and model sources as inputs in the Kaggle session (Notebook -> Input -> Add Input):

| Resource | Type | Kaggle URL | Version |
|---|---|---|---|
| PRW -- Person Re-identification in the Wild | Dataset | [https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild](https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild) | 1 |
| PersonViT Pre-trained Weights | Model | [https://www.kaggle.com/models/simonerimondi/personvit](https://www.kaggle.com/models/simonerimondi/personvit) | 4 |

The dataset and model paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.

The PersonViT model code is cloned at runtime from a personal GitHub fork ([github.com/simoswish02/PersonViT](https://github.com/simoswish02/PersonViT)) that fixes an import error present in the original upstream repository.


## Notebook Structure


| Phase | Description |
|---|---|
| 0 - Pretrained Baselines | Zero-shot evaluation of all four ViT-Small checkpoints on PRW |
| 1 - Strategy Comparison | Full FT vs. Partial FT vs. Freeze+Retrain, ArcFace loss fixed (ViT-Small) |
| 2 - Loss Comparison | TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss, best strategy fixed (ViT-Small) |
| 3 - Scale-Up | Winning configuration replicated on ViT-Base (embedding dim 768) |
| Final | Cross-model comparison: metrics, params, GFLOPs, latency |


## Outputs


All results are saved under `./evaluation_results/`:
- `all_results.json` -- incremental results registry, persisted after each run
- `all_results.csv` -- final summary table sorted by mAP
- `plots/` -- all generated figures (bar charts, radar charts, training curves, Small vs. Base delta table)

Model checkpoints are saved under `/kaggle/working/personvit_finetuning/`.


## Key Results


| Run | Strategy | Loss | mAP (%) | Rank-1 (%) | Params (M) | Latency (ms) |
|---|---|---|---|---|---|---|
| vit_base_full_triplet | Full FT | Triplet | 85.65 | 94.51 | 86.5 | 11.80 |
| full_triplet | Full FT | Triplet | 81.50 | 93.44 | 22.0 | 7.02 |
| full_ntxent | Full FT | NTXent | 80.77 | 93.15 | 22.0 | 7.47 |
| full_arcface | Full FT | ArcFace | 78.10 | 93.39 | 22.0 | 7.34 |
| freeze_arcface | Freeze+Retrain | ArcFace | 75.64 | 93.19 | 22.0 | 7.32 |
| partial_arcface | Partial FT | ArcFace | 75.62 | 93.00 | 22.0 | 7.56 |
| market1501 (zero-shot) | Pretrained | -- | 75.26 | 92.90 | 21.6 | 7.62 |

ViT-Base with Full FT and TripletMarginLoss achieves the best mAP (85.65%) at the cost of 1.68x higher latency compared to the best ViT-Small run. ViT-Small with Full FT and TripletMarginLoss is the recommended alternative for throughput-constrained deployments, with mAP of 81.50% at 7.02 ms per image.


## Reproducibility


A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a run key is already present in `RESULTS`, training is skipped and the existing entry is used directly. Results can be fully restored from `all_results.json` after a kernel restart without re-running any training phase.


## Estimated Runtime


Full end-to-end execution (all four phases, evaluation, and plotting) takes approximately **15-20 hours** on a Kaggle single T4 session, depending on dataset I/O speed and Kaggle session availability.


# References

[1] Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., and Tian, Q.
"Person Re-identification in the Wild."
*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017.
[https://arxiv.org/abs/1604.02531](https://arxiv.org/abs/1604.02531)

[2] Hu, B., Wang, X., and Liu, W.
"PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification."
*Machine Vision and Applications*, 2025. DOI: 10.1007/s00138-025-01659-y
[https://arxiv.org/abs/2408.05398](https://arxiv.org/abs/2408.05398)

[3] hustvl.
"PersonViT — Official GitHub Repository."
[https://github.com/hustvl/PersonViT](https://github.com/hustvl/PersonViT)

[4] lakeAGI.
"PersonViT Pre-trained Weights (ViT-Base and ViT-Small)."
*Hugging Face Model Hub*, 2024.
[https://huggingface.co/lakeAGI/PersonViTReID/tree/main](https://huggingface.co/lakeAGI/PersonViTReID/tree/main)

[5] He, S., Luo, H., Wang, P., Wang, F., Li, H., and Jiang, W.
"TransReID: Transformer-based Object Re-Identification."
*IEEE International Conference on Computer Vision (ICCV)*, 2021.
[https://arxiv.org/abs/2102.04378](https://arxiv.org/abs/2102.04378)

[6] Musgrave, K., Belongie, S., and Lim, S.
"PyTorch Metric Learning."
*arXiv preprint arXiv:2008.09164*, 2020.
[https://arxiv.org/abs/2008.09164](https://arxiv.org/abs/2008.09164)