Stage 5b: Popcount Reformulation
Reformulate the Stage 0 classifier so each of 40 feature channels first binarizes against an offline-calibrated INT8 threshold, then two 20-bit popcounts replace Stage 5's 40-term 8-bit signed adder tree.
Reformulation
Stage 5 form:
score = sum(pos_dims) β sum(neg_dims) // 40 Γ 8-bit adder tree
output = score > T // signed comparator
Stage 5b form:
for each dim i: b_i = (f_i > t_i) // 40 Γ 8-bit comparators
count_pos = popcount(b_0 .. b_19) // 20 β 5 bits
count_neg = popcount(b_20 .. b_39) // 20 β 5 bits
score_int = count_pos β count_neg // 6-bit subtract
output = score_int > K // small comparator
Offline calibration (see calibrate_popcount.py):
- For each of the 40 classifier dims, sweep a per-dim threshold on COCO val pooled-LN features, pick the one that best separates person-positive vs person-negative.
- Pick the final integer threshold K by sweeping K β [β20, 20].
Accuracy
variant F1 ΞF1 vs Stage 0
Stage 0 additive (float) 0.884 β
Stage 5b popcount (quantized, K=13) 0.876 β0.008
40 per-dim thresholds live in per_dim_thresholds.json; K = 13.
Synthesis (Yosys + ABC, target gate library AND + XOR)
Both variants are synthesized with their constants baked in β the realistic deployment shape where calibrated thresholds are hard-wired.
| Variant | Cells | AND | NOT | XOR |
|---|---|---|---|---|
| Stage 5 additive (threshold baked) | 3,129 | 1,133 | 1,281 | 715 |
| Stage 5b popcount (thresholds baked) | 907 | 378 | 452 | 77 |
| Reduction | β71 % | β67 % | β65 % | β89 % |
The XOR drop is the cleanest signal: popcount barely needs any, because there is no multi-bit addition chain to implement.
For historical reference, the unfolded-threshold variants (thresholds as runtime inputs) came in at Stage 5 = 3,220 cells and Stage 5b = 3,040 cells β much narrower gap. Constant folding is where the popcount reformulation earns its keep.
Files
calibrate_popcount.pyβ per-dim threshold calibration + F1 sweepper_dim_thresholds.jsonβ calibrated float thresholds + INT8-quantized forms used in the Verilogperson_classifier_popcount.vβ RTL with runtime thresholdsperson_classifier_popcount_folded.vβ RTL with thresholds baked inperson_classifier_sum_folded.vβ Stage 5-equivalent additive classifier with threshold baked in (fair-comparison baseline)synth*.ys,synth*.logβ Yosys scripts and logs
Deployment implication
907 gates at a modern 22 nm FD-SOI process is sub-0.01 mmΒ². The circuit fits inside an always-on wake block of a camera sensor ISP. Throughput is combinational and one-cycle; the only external cost is the per-frame backbone forward feeding the 40 selected INT8 channels.