# Stage 5b: Popcount Reformulation Reformulate the Stage 0 classifier so each of 40 feature channels first binarizes against an offline-calibrated INT8 threshold, then two 20-bit popcounts replace Stage 5's 40-term 8-bit signed adder tree. ## Reformulation Stage 5 form: ``` score = sum(pos_dims) − sum(neg_dims) // 40 × 8-bit adder tree output = score > T // signed comparator ``` Stage 5b form: ``` for each dim i: b_i = (f_i > t_i) // 40 × 8-bit comparators count_pos = popcount(b_0 .. b_19) // 20 → 5 bits count_neg = popcount(b_20 .. b_39) // 20 → 5 bits score_int = count_pos − count_neg // 6-bit subtract output = score_int > K // small comparator ``` Offline calibration (see `calibrate_popcount.py`): - For each of the 40 classifier dims, sweep a per-dim threshold on COCO val pooled-LN features, pick the one that best separates person-positive vs person-negative. - Pick the final integer threshold K by sweeping K ∈ [−20, 20]. ## Accuracy ``` variant F1 ΔF1 vs Stage 0 Stage 0 additive (float) 0.884 — Stage 5b popcount (quantized, K=13) 0.876 −0.008 ``` 40 per-dim thresholds live in `per_dim_thresholds.json`; K = 13. ## Synthesis (Yosys + ABC, target gate library AND + XOR) Both variants are synthesized with their constants baked in — the realistic deployment shape where calibrated thresholds are hard-wired. | Variant | Cells | AND | NOT | XOR | |---|---:|---:|---:|---:| | Stage 5 additive (threshold baked) | 3,129 | 1,133 | 1,281 | 715 | | **Stage 5b popcount (thresholds baked)** | **907** | **378** | **452** | **77** | | Reduction | **−71 %** | −67 % | −65 % | −89 % | The XOR drop is the cleanest signal: popcount barely needs any, because there is no multi-bit addition chain to implement. For historical reference, the unfolded-threshold variants (thresholds as runtime inputs) came in at Stage 5 = 3,220 cells and Stage 5b = 3,040 cells — much narrower gap. Constant folding is where the popcount reformulation earns its keep. ## Files - `calibrate_popcount.py` — per-dim threshold calibration + F1 sweep - `per_dim_thresholds.json` — calibrated float thresholds + INT8-quantized forms used in the Verilog - `person_classifier_popcount.v` — RTL with runtime thresholds - `person_classifier_popcount_folded.v` — RTL with thresholds baked in - `person_classifier_sum_folded.v` — Stage 5-equivalent additive classifier with threshold baked in (fair-comparison baseline) - `synth*.ys`, `synth*.log` — Yosys scripts and logs ## Deployment implication 907 gates at a modern 22 nm FD-SOI process is sub-0.01 mm². The circuit fits inside an always-on wake block of a camera sensor ISP. Throughput is combinational and one-cycle; the only external cost is the per-frame backbone forward feeding the 40 selected INT8 channels.