File size: 3,138 Bytes
d2aa423
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Stage 5b: Popcount Reformulation

Reformulate the Stage 0 classifier so each of 40 feature channels first binarizes against an offline-calibrated INT8 threshold, then two 20-bit popcounts replace Stage 5's 40-term 8-bit signed adder tree.

## Reformulation

Stage 5 form:

```
score  = sum(pos_dims) βˆ’ sum(neg_dims)                               // 40 Γ— 8-bit adder tree
output = score > T                                                    // signed comparator
```

Stage 5b form:

```
for each dim i:  b_i = (f_i > t_i)                                   // 40 Γ— 8-bit comparators
count_pos      = popcount(b_0 .. b_19)                               // 20 β†’ 5 bits
count_neg      = popcount(b_20 .. b_39)                              // 20 β†’ 5 bits
score_int      = count_pos βˆ’ count_neg                               // 6-bit subtract
output         = score_int > K                                        // small comparator
```

Offline calibration (see `calibrate_popcount.py`):
- For each of the 40 classifier dims, sweep a per-dim threshold on COCO val pooled-LN features, pick the one that best separates person-positive vs person-negative.
- Pick the final integer threshold K by sweeping K ∈ [βˆ’20, 20].

## Accuracy

```
variant                                  F1      Ξ”F1 vs Stage 0
Stage 0 additive (float)                0.884   β€”
Stage 5b popcount (quantized, K=13)     0.876   βˆ’0.008
```

40 per-dim thresholds live in `per_dim_thresholds.json`; K = 13.

## Synthesis (Yosys + ABC, target gate library AND + XOR)

Both variants are synthesized with their constants baked in β€” the realistic deployment shape where calibrated thresholds are hard-wired.

| Variant | Cells | AND | NOT | XOR |
|---|---:|---:|---:|---:|
| Stage 5 additive (threshold baked) | 3,129 | 1,133 | 1,281 | 715 |
| **Stage 5b popcount (thresholds baked)** | **907** | **378** | **452** | **77** |
| Reduction | **βˆ’71 %** | βˆ’67 % | βˆ’65 % | βˆ’89 % |

The XOR drop is the cleanest signal: popcount barely needs any, because there is no multi-bit addition chain to implement.

For historical reference, the unfolded-threshold variants (thresholds as runtime inputs) came in at Stage 5 = 3,220 cells and Stage 5b = 3,040 cells β€” much narrower gap. Constant folding is where the popcount reformulation earns its keep.

## Files

- `calibrate_popcount.py` β€” per-dim threshold calibration + F1 sweep
- `per_dim_thresholds.json` β€” calibrated float thresholds + INT8-quantized forms used in the Verilog
- `person_classifier_popcount.v` β€” RTL with runtime thresholds
- `person_classifier_popcount_folded.v` β€” RTL with thresholds baked in
- `person_classifier_sum_folded.v` β€” Stage 5-equivalent additive classifier with threshold baked in (fair-comparison baseline)
- `synth*.ys`, `synth*.log` β€” Yosys scripts and logs

## Deployment implication

907 gates at a modern 22 nm FD-SOI process is sub-0.01 mmΒ². The circuit fits inside an always-on wake block of a camera sensor ISP. Throughput is combinational and one-cycle; the only external cost is the per-frame backbone forward feeding the 40 selected INT8 channels.