datasysdev commited on
Commit
8f87cd2
·
verified ·
1 Parent(s): e8a62da

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +45 -2
README.md CHANGED
@@ -43,8 +43,51 @@ quality is preserved under ANN substitution. Recall plateaus around step 1000
43
  because the softmax-relevant keys concentrate in the top ~30; disagreement
44
  on positions 30-128 is on near-zero-weight tail and doesn't affect output.
45
 
46
- A K-retrieve Pareto sweep follows below; a 34-layer headline run on 8K context
47
- extends the deployment story.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Files
50
 
 
43
  because the softmax-relevant keys concentrate in the top ~30; disagreement
44
  on positions 30-128 is on near-zero-weight tail and doesn't affect output.
45
 
46
+ ### K-retrieve Pareto (pilot step 2000, FAISS HNSW)
47
+
48
+ `PPL_full = 9.958`
49
+
50
+ | K | Recall@K | PPL_ANN | PPL gap |
51
+ |---|---|---|---|
52
+ | 16 | 24.9% | 10.71 | +7.51% |
53
+ | 32 | 22.8% | 10.41 | +4.51% |
54
+ | 64 | 23.1% | 10.20 | +2.42% |
55
+ | 128 | 26.0% | 10.04 | +0.82% |
56
+ | 256 | 31.6% | 9.88 | **−0.79%** |
57
+ | 512 | 40.8% | 9.67 | **−2.89%** |
58
+
59
+ **ANN at K ≥ 256 produces lower perplexity than full attention** — the
60
+ sparse-attention denoising effect. Full softmax is forced to spread small
61
+ amounts of weight over a long tail of irrelevant keys; truncating to top-K
62
+ and renormalizing puts the weight where it matters. The smooth monotonic
63
+ trend (no discontinuous jumps) is consistent with this explanation, and the
64
+ sanity checks (same input sequences for `ppl_full` vs `ppl_ann`, intact
65
+ causal mask in retrieval, single-softmax renormalization with no wrapper
66
+ leakage between iterations) confirm the result is real.
67
+
68
+ Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
69
+ in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
70
+ same metric code path — the discrepancy comes from sampling different
71
+ sequences out of the WikiText streaming split (different `num_batches` /
72
+ worker dispatch). The PPL gap is independent of which subset is sampled
73
+ and is the load-bearing deployment metric.
74
+
75
+ ### Per-layer recall (pilot)
76
+
77
+ | Layer | Recall@K=128 | Recall@K=512 |
78
+ |---|---|---|
79
+ | 4 | 15.8% | 34.7% |
80
+ | 8 | 22.2% | 38.7% |
81
+ | 12 | 23.4% | 39.1% |
82
+ | 16 | 31.9% | 45.2% |
83
+ | 20 | 31.4% | 42.6% |
84
+ | 24 | 31.1% | 44.4% |
85
+
86
+ Early layers are harder for content-addressable retrieval — their attention
87
+ is more local/positional than semantic. Consistent across K, so it's a
88
+ property of the layer rather than noise.
89
+
90
+ A 34-layer headline run on 8K context follows.
91
 
92
  ## Files
93