Add per-parameter pruning rows + joint pruning result. R1 remains leading Pareto point.
Browse files
README.md
CHANGED
|
@@ -170,8 +170,14 @@ Cofiber Threshold variants trained on full COCO 2017 train (117,266 images). Fro
|
|
| 170 |
| box32 pruned R2 | 768β32β4 | 91,640 | ~62,000 | **5.9** | 20.4 | **1.5** |
|
| 171 |
| box32 pruned R3 | 768β32β4 | 91,640 | ~47,000 | 5.1 | 17.1 | 1.4 |
|
| 172 |
| **dim20** | **768β20β80 cls, 20β16β4 reg** | **22,076** | **22,076** | **3.9** | **14.8** | 0.9 |
|
| 173 |
-
| **dim20
|
| 174 |
-
| **dim20
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
| **dim15** | **768β15β80 cls, 15β16β4 reg** | **17,751** | **17,751** | **3.0** | **11.5** | 0.7 |
|
| 176 |
| **dim10** | **768β10β80 cls, 10β16β4 reg** | **13,426** | **13,426** | **1.5** | **5.6** | 0.4 |
|
| 177 |
| **dim5** | **768β5β80 cls, 5β16β4 reg** | **9,101** | **9,101** | **0.3** | **1.3** | 0.1 |
|
|
@@ -180,7 +186,17 @@ Pruning improved mAP from 5.7 to 5.9 at R2 (~62K nonzero) by removing noisy prot
|
|
| 180 |
|
| 181 |
The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP β the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 β dim15 β dim10 is roughly geometric (3.9 β 3.0 β 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.
|
| 182 |
|
| 183 |
-
Dim20 was then itself put through magnitude pruning. The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
#### Training recipe
|
| 186 |
|
|
|
|
| 170 |
| box32 pruned R2 | 768β32β4 | 91,640 | ~62,000 | **5.9** | 20.4 | **1.5** |
|
| 171 |
| box32 pruned R3 | 768β32β4 | 91,640 | ~47,000 | 5.1 | 17.1 | 1.4 |
|
| 172 |
| **dim20** | **768β20β80 cls, 20β16β4 reg** | **22,076** | **22,076** | **3.9** | **14.8** | 0.9 |
|
| 173 |
+
| **dim20 R1** (project 25.7% sparse) | 768β20β80 cls | 22,076 | 18,121 | **3.9** | 14.6 | 0.8 |
|
| 174 |
+
| **dim20 R2** (project 26.6% sparse) | 768β20β80 cls | 22,076 | 17,988 | 3.8 | 14.5 | 0.7 |
|
| 175 |
+
| dim20 cls_weight pruned (37%) | 596 of 1600 cls weights zeroed | 22,076 | 21,480 | 3.8 | 14.4 | 0.7 |
|
| 176 |
+
| dim20 reg_hidden pruned (17%) | 55 of 320 reg weights zeroed | 22,076 | 22,021 | 3.8 | 14.5 | 0.7 |
|
| 177 |
+
| dim20 reg_out pruned (12%) | 8 of 64 reg weights zeroed | 22,076 | 22,068 | 3.8 | 14.5 | 0.7 |
|
| 178 |
+
| dim20 ctr_weight pruned (90%) | 18 of 20 ctr weights zeroed | 22,076 | 22,058 | 3.7 | 14.2 | 0.7 |
|
| 179 |
+
| dim20 R1 + cls greedy | project 25.7% + cls 45% sparse | 22,076 | 17,406 | 3.5 | 13.4 | 0.6 |
|
| 180 |
+
| dim20 joint (from R1) | whole-head magnitude pruning | 22,076 | 17,129 | 3.6 | 13.7 | 0.6 |
|
| 181 |
| **dim15** | **768β15β80 cls, 15β16β4 reg** | **17,751** | **17,751** | **3.0** | **11.5** | 0.7 |
|
| 182 |
| **dim10** | **768β10β80 cls, 10β16β4 reg** | **13,426** | **13,426** | **1.5** | **5.6** | 0.4 |
|
| 183 |
| **dim5** | **768β5β80 cls, 5β16β4 reg** | **9,101** | **9,101** | **0.3** | **1.3** | 0.1 |
|
|
|
|
| 186 |
|
| 187 |
The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP β the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 β dim15 β dim10 is roughly geometric (3.9 β 3.0 β 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.
|
| 188 |
|
| 189 |
+
Dim20 was then itself put through magnitude pruning. The mAP-driven pruner bisects over the magnitude-sorted weight list of a chosen parameter, uses full pycocotools mAP@[0.5:0.95] as the retention metric (1000 val images), and rolls back any pass that fails the 95% retention floor on full verification. It was run separately on each learned parameter of dim20 plus a joint-magnitude variant that ranks every weight in the head against every other.
|
| 190 |
+
|
| 191 |
+
The leading Pareto point is **R1 (project layer 25.7% sparse, 18,121 nonzero, 3.9 mAP)** β same mAP as unpruned dim20 with 18% fewer effective parameters, the highest mAP-per-10K-parameter ratio in the table at 2.15. R2 pushes to 26.6% project sparsity (17,988 nonzero) at a small mAP cost (3.8). Per-parameter slack measurements:
|
| 192 |
+
|
| 193 |
+
- `project.weight`: 26.6% prunable (the sparsity that produced R1/R2)
|
| 194 |
+
- `cls_weight`: 37% prunable in isolation, 3.8 mAP at 21,480 nonzero
|
| 195 |
+
- `reg_hidden.weight`: 17% prunable, 3.8 mAP at 22,021 nonzero
|
| 196 |
+
- `reg_out.weight`: 12% prunable, 3.8 mAP at 22,068 nonzero
|
| 197 |
+
- `ctr_weight`: **90% prunable** (only 2 of 20 centerness weights load-bear), 3.7 mAP at 22,058 nonzero
|
| 198 |
+
|
| 199 |
+
Greedy stacking of cls_weight pruning on top of R1 reaches 17,406 nonzero but drops to 3.5 mAP β interaction between the parameters: the cls_weight slack measured against unpruned dim20 partly comes from compensating for the surviving project subspace, so removing it after pruning project costs more mAP than the per-parameter measurement suggested. Joint magnitude pruning across all 22K head weights (starting from R1) finds 17,129 nonzero at 3.6 mAP, which is the smallest dim20 found but does not Pareto-dominate R1 β the bisection's 1000-image mAP proxy was systematically optimistic relative to the full 5000-image eval, so the 95% retention floor measured during pruning gave more aggressive cuts than the full eval would have accepted. R1 remains the leading point of the dim20 pruning Pareto.
|
| 200 |
|
| 201 |
#### Training recipe
|
| 202 |
|