File size: 3,138 Bytes
d2aa423 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | # Stage 5b: Popcount Reformulation
Reformulate the Stage 0 classifier so each of 40 feature channels first binarizes against an offline-calibrated INT8 threshold, then two 20-bit popcounts replace Stage 5's 40-term 8-bit signed adder tree.
## Reformulation
Stage 5 form:
```
score = sum(pos_dims) β sum(neg_dims) // 40 Γ 8-bit adder tree
output = score > T // signed comparator
```
Stage 5b form:
```
for each dim i: b_i = (f_i > t_i) // 40 Γ 8-bit comparators
count_pos = popcount(b_0 .. b_19) // 20 β 5 bits
count_neg = popcount(b_20 .. b_39) // 20 β 5 bits
score_int = count_pos β count_neg // 6-bit subtract
output = score_int > K // small comparator
```
Offline calibration (see `calibrate_popcount.py`):
- For each of the 40 classifier dims, sweep a per-dim threshold on COCO val pooled-LN features, pick the one that best separates person-positive vs person-negative.
- Pick the final integer threshold K by sweeping K β [β20, 20].
## Accuracy
```
variant F1 ΞF1 vs Stage 0
Stage 0 additive (float) 0.884 β
Stage 5b popcount (quantized, K=13) 0.876 β0.008
```
40 per-dim thresholds live in `per_dim_thresholds.json`; K = 13.
## Synthesis (Yosys + ABC, target gate library AND + XOR)
Both variants are synthesized with their constants baked in β the realistic deployment shape where calibrated thresholds are hard-wired.
| Variant | Cells | AND | NOT | XOR |
|---|---:|---:|---:|---:|
| Stage 5 additive (threshold baked) | 3,129 | 1,133 | 1,281 | 715 |
| **Stage 5b popcount (thresholds baked)** | **907** | **378** | **452** | **77** |
| Reduction | **β71 %** | β67 % | β65 % | β89 % |
The XOR drop is the cleanest signal: popcount barely needs any, because there is no multi-bit addition chain to implement.
For historical reference, the unfolded-threshold variants (thresholds as runtime inputs) came in at Stage 5 = 3,220 cells and Stage 5b = 3,040 cells β much narrower gap. Constant folding is where the popcount reformulation earns its keep.
## Files
- `calibrate_popcount.py` β per-dim threshold calibration + F1 sweep
- `per_dim_thresholds.json` β calibrated float thresholds + INT8-quantized forms used in the Verilog
- `person_classifier_popcount.v` β RTL with runtime thresholds
- `person_classifier_popcount_folded.v` β RTL with thresholds baked in
- `person_classifier_sum_folded.v` β Stage 5-equivalent additive classifier with threshold baked in (fair-comparison baseline)
- `synth*.ys`, `synth*.log` β Yosys scripts and logs
## Deployment implication
907 gates at a modern 22 nm FD-SOI process is sub-0.01 mmΒ². The circuit fits inside an always-on wake block of a camera sensor ISP. Throughput is combinational and one-cycle; the only external cost is the per-frame backbone forward feeding the 40 selected INT8 channels.
|