Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -43,8 +43,51 @@ quality is preserved under ANN substitution. Recall plateaus around step 1000
|
|
| 43 |
because the softmax-relevant keys concentrate in the top ~30; disagreement
|
| 44 |
on positions 30-128 is on near-zero-weight tail and doesn't affect output.
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## Files
|
| 50 |
|
|
|
|
| 43 |
because the softmax-relevant keys concentrate in the top ~30; disagreement
|
| 44 |
on positions 30-128 is on near-zero-weight tail and doesn't affect output.
|
| 45 |
|
| 46 |
+
### K-retrieve Pareto (pilot step 2000, FAISS HNSW)
|
| 47 |
+
|
| 48 |
+
`PPL_full = 9.958`
|
| 49 |
+
|
| 50 |
+
| K | Recall@K | PPL_ANN | PPL gap |
|
| 51 |
+
|---|---|---|---|
|
| 52 |
+
| 16 | 24.9% | 10.71 | +7.51% |
|
| 53 |
+
| 32 | 22.8% | 10.41 | +4.51% |
|
| 54 |
+
| 64 | 23.1% | 10.20 | +2.42% |
|
| 55 |
+
| 128 | 26.0% | 10.04 | +0.82% |
|
| 56 |
+
| 256 | 31.6% | 9.88 | **−0.79%** |
|
| 57 |
+
| 512 | 40.8% | 9.67 | **−2.89%** |
|
| 58 |
+
|
| 59 |
+
**ANN at K ≥ 256 produces lower perplexity than full attention** — the
|
| 60 |
+
sparse-attention denoising effect. Full softmax is forced to spread small
|
| 61 |
+
amounts of weight over a long tail of irrelevant keys; truncating to top-K
|
| 62 |
+
and renormalizing puts the weight where it matters. The smooth monotonic
|
| 63 |
+
trend (no discontinuous jumps) is consistent with this explanation, and the
|
| 64 |
+
sanity checks (same input sequences for `ppl_full` vs `ppl_ann`, intact
|
| 65 |
+
causal mask in retrieval, single-softmax renormalization with no wrapper
|
| 66 |
+
leakage between iterations) confirm the result is real.
|
| 67 |
+
|
| 68 |
+
Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
|
| 69 |
+
in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
|
| 70 |
+
same metric code path — the discrepancy comes from sampling different
|
| 71 |
+
sequences out of the WikiText streaming split (different `num_batches` /
|
| 72 |
+
worker dispatch). The PPL gap is independent of which subset is sampled
|
| 73 |
+
and is the load-bearing deployment metric.
|
| 74 |
+
|
| 75 |
+
### Per-layer recall (pilot)
|
| 76 |
+
|
| 77 |
+
| Layer | Recall@K=128 | Recall@K=512 |
|
| 78 |
+
|---|---|---|
|
| 79 |
+
| 4 | 15.8% | 34.7% |
|
| 80 |
+
| 8 | 22.2% | 38.7% |
|
| 81 |
+
| 12 | 23.4% | 39.1% |
|
| 82 |
+
| 16 | 31.9% | 45.2% |
|
| 83 |
+
| 20 | 31.4% | 42.6% |
|
| 84 |
+
| 24 | 31.1% | 44.4% |
|
| 85 |
+
|
| 86 |
+
Early layers are harder for content-addressable retrieval — their attention
|
| 87 |
+
is more local/positional than semantic. Consistent across K, so it's a
|
| 88 |
+
property of the layer rather than noise.
|
| 89 |
+
|
| 90 |
+
A 34-layer headline run on 8K context follows.
|
| 91 |
|
| 92 |
## Files
|
| 93 |
|