phanerozoic's picture
Stage 5b: popcount reformulation, 907 gates (71 pct fewer than Stage 5), F1 0.876
d2aa423 verified

Stage 5b: Popcount Reformulation

Reformulate the Stage 0 classifier so each of 40 feature channels first binarizes against an offline-calibrated INT8 threshold, then two 20-bit popcounts replace Stage 5's 40-term 8-bit signed adder tree.

Reformulation

Stage 5 form:

score  = sum(pos_dims) βˆ’ sum(neg_dims)                               // 40 Γ— 8-bit adder tree
output = score > T                                                    // signed comparator

Stage 5b form:

for each dim i:  b_i = (f_i > t_i)                                   // 40 Γ— 8-bit comparators
count_pos      = popcount(b_0 .. b_19)                               // 20 β†’ 5 bits
count_neg      = popcount(b_20 .. b_39)                              // 20 β†’ 5 bits
score_int      = count_pos βˆ’ count_neg                               // 6-bit subtract
output         = score_int > K                                        // small comparator

Offline calibration (see calibrate_popcount.py):

  • For each of the 40 classifier dims, sweep a per-dim threshold on COCO val pooled-LN features, pick the one that best separates person-positive vs person-negative.
  • Pick the final integer threshold K by sweeping K ∈ [βˆ’20, 20].

Accuracy

variant                                  F1      Ξ”F1 vs Stage 0
Stage 0 additive (float)                0.884   β€”
Stage 5b popcount (quantized, K=13)     0.876   βˆ’0.008

40 per-dim thresholds live in per_dim_thresholds.json; K = 13.

Synthesis (Yosys + ABC, target gate library AND + XOR)

Both variants are synthesized with their constants baked in β€” the realistic deployment shape where calibrated thresholds are hard-wired.

Variant Cells AND NOT XOR
Stage 5 additive (threshold baked) 3,129 1,133 1,281 715
Stage 5b popcount (thresholds baked) 907 378 452 77
Reduction βˆ’71 % βˆ’67 % βˆ’65 % βˆ’89 %

The XOR drop is the cleanest signal: popcount barely needs any, because there is no multi-bit addition chain to implement.

For historical reference, the unfolded-threshold variants (thresholds as runtime inputs) came in at Stage 5 = 3,220 cells and Stage 5b = 3,040 cells β€” much narrower gap. Constant folding is where the popcount reformulation earns its keep.

Files

  • calibrate_popcount.py β€” per-dim threshold calibration + F1 sweep
  • per_dim_thresholds.json β€” calibrated float thresholds + INT8-quantized forms used in the Verilog
  • person_classifier_popcount.v β€” RTL with runtime thresholds
  • person_classifier_popcount_folded.v β€” RTL with thresholds baked in
  • person_classifier_sum_folded.v β€” Stage 5-equivalent additive classifier with threshold baked in (fair-comparison baseline)
  • synth*.ys, synth*.log β€” Yosys scripts and logs

Deployment implication

907 gates at a modern 22 nm FD-SOI process is sub-0.01 mmΒ². The circuit fits inside an always-on wake block of a camera sensor ISP. Throughput is combinational and one-cycle; the only external cost is the per-frame backbone forward feeding the 40 selected INT8 channels.