phanerozoic
/

detection-heads

@@ -170,8 +170,14 @@ Cofiber Threshold variants trained on full COCO 2017 train (117,266 images). Fro
 | box32 pruned R2 | 768→32→4 | 91,640 | ~62,000 | **5.9** | 20.4 | **1.5** |
 | box32 pruned R3 | 768→32→4 | 91,640 | ~47,000 | 5.1 | 17.1 | 1.4 |
 | **dim20** | **768→20→80 cls, 20→16→4 reg** | **22,076** | **22,076** | **3.9** | **14.8** | 0.9 |
-| **dim20 pruned R1** | **768→20→80 cls (project 25.7% sparse)** | **22,076** | **18,121** | **3.9** | **14.6** | 0.8 |
-| **dim20 pruned R2** | **768→20→80 cls (project 26.6% sparse)** | **22,076** | **17,988** | **3.8** | **14.5** | 0.7 |
 | **dim15** | **768→15→80 cls, 15→16→4 reg** | **17,751** | **17,751** | **3.0** | **11.5** | 0.7 |
 | **dim10** | **768→10→80 cls, 10→16→4 reg** | **13,426** | **13,426** | **1.5** | **5.6** | 0.4 |
 | **dim5** | **768→5→80 cls, 5→16→4 reg** | **9,101** | **9,101** | **0.3** | **1.3** | 0.1 |
@@ -180,7 +186,17 @@ Pruning improved mAP from 5.7 to 5.9 at R2 (~62K nonzero) by removing noisy prot
 The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP — the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 → dim15 → dim10 is roughly geometric (3.9 → 3.0 → 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.
-Dim20 was then itself put through magnitude pruning. The rigorous mAP-driven pruner bisected over the 15,360 weights of the projection layer using full pycocotools mAP as the retention metric and a corrective per-pass rollback. It found that 4,088 of those weights (26.6%) can be zeroed before mAP@[0.5:0.95] drops below the 95% retention floor. The Pareto front has two points: R1 zeros 3,955 weights (25.7%) for **3.9 mAP at 18,121 nonzero head parameters**, and R2 zeros 4,088 weights (26.6%) for **3.8 mAP at 17,988 nonzero head parameters**. R1 reaches 2.15 mAP per 10K parameters — the most parameter-efficient detector in the table, beating both unpruned dim20 (1.77) and NanoDet-m-0.5x (1.44).
 #### Training recipe

 | box32 pruned R2 | 768→32→4 | 91,640 | ~62,000 | **5.9** | 20.4 | **1.5** |
 | box32 pruned R3 | 768→32→4 | 91,640 | ~47,000 | 5.1 | 17.1 | 1.4 |
 | **dim20** | **768→20→80 cls, 20→16→4 reg** | **22,076** | **22,076** | **3.9** | **14.8** | 0.9 |
+| **dim20 R1** (project 25.7% sparse) | 768→20→80 cls | 22,076 | 18,121 | **3.9** | 14.6 | 0.8 |
+| **dim20 R2** (project 26.6% sparse) | 768→20→80 cls | 22,076 | 17,988 | 3.8 | 14.5 | 0.7 |
+| dim20 cls_weight pruned (37%) | 596 of 1600 cls weights zeroed | 22,076 | 21,480 | 3.8 | 14.4 | 0.7 |
+| dim20 reg_hidden pruned (17%) | 55 of 320 reg weights zeroed | 22,076 | 22,021 | 3.8 | 14.5 | 0.7 |
+| dim20 reg_out pruned (12%) | 8 of 64 reg weights zeroed | 22,076 | 22,068 | 3.8 | 14.5 | 0.7 |
+| dim20 ctr_weight pruned (90%) | 18 of 20 ctr weights zeroed | 22,076 | 22,058 | 3.7 | 14.2 | 0.7 |
+| dim20 R1 + cls greedy | project 25.7% + cls 45% sparse | 22,076 | 17,406 | 3.5 | 13.4 | 0.6 |
+| dim20 joint (from R1) | whole-head magnitude pruning | 22,076 | 17,129 | 3.6 | 13.7 | 0.6 |
 | **dim15** | **768→15→80 cls, 15→16→4 reg** | **17,751** | **17,751** | **3.0** | **11.5** | 0.7 |
 | **dim10** | **768→10→80 cls, 10→16→4 reg** | **13,426** | **13,426** | **1.5** | **5.6** | 0.4 |
 | **dim5** | **768→5→80 cls, 5→16→4 reg** | **9,101** | **9,101** | **0.3** | **1.3** | 0.1 |
 The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP — the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 → dim15 → dim10 is roughly geometric (3.9 → 3.0 → 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.
+Dim20 was then itself put through magnitude pruning. The mAP-driven pruner bisects over the magnitude-sorted weight list of a chosen parameter, uses full pycocotools mAP@[0.5:0.95] as the retention metric (1000 val images), and rolls back any pass that fails the 95% retention floor on full verification. It was run separately on each learned parameter of dim20 plus a joint-magnitude variant that ranks every weight in the head against every other.
+The leading Pareto point is **R1 (project layer 25.7% sparse, 18,121 nonzero, 3.9 mAP)** — same mAP as unpruned dim20 with 18% fewer effective parameters, the highest mAP-per-10K-parameter ratio in the table at 2.15. R2 pushes to 26.6% project sparsity (17,988 nonzero) at a small mAP cost (3.8). Per-parameter slack measurements:
+- `project.weight`: 26.6% prunable (the sparsity that produced R1/R2)
+- `cls_weight`: 37% prunable in isolation, 3.8 mAP at 21,480 nonzero
+- `reg_hidden.weight`: 17% prunable, 3.8 mAP at 22,021 nonzero
+- `reg_out.weight`: 12% prunable, 3.8 mAP at 22,068 nonzero
+- `ctr_weight`: **90% prunable** (only 2 of 20 centerness weights load-bear), 3.7 mAP at 22,058 nonzero
+Greedy stacking of cls_weight pruning on top of R1 reaches 17,406 nonzero but drops to 3.5 mAP — interaction between the parameters: the cls_weight slack measured against unpruned dim20 partly comes from compensating for the surviving project subspace, so removing it after pruning project costs more mAP than the per-parameter measurement suggested. Joint magnitude pruning across all 22K head weights (starting from R1) finds 17,129 nonzero at 3.6 mAP, which is the smallest dim20 found but does not Pareto-dominate R1 — the bisection's 1000-image mAP proxy was systematically optimistic relative to the full 5000-image eval, so the 95% retention floor measured during pruning gave more aggressive cuts than the full eval would have accepted. R1 remains the leading point of the dim20 pruning Pareto.
 #### Training recipe