datasysdev commited on
Commit
b42f744
·
verified ·
1 Parent(s): c740ca3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +44 -17
README.md CHANGED
@@ -70,22 +70,31 @@ Sanity checks pass: same input sequences for `ppl_full` vs `ppl_ann`,
70
  intact causal mask in retrieval, single-softmax renormalization with no
71
  wrapper leakage between iterations.
72
 
73
- ### Deployment knobs (L = 4096)
74
 
75
- | Use case | K | PPL gap | Attention compute reduction |
76
- |---|---|---|---|
77
- | Quality-improving | 256 | −0.79% | ~16× |
78
- | Quality-improving | 512 | −2.89% | ~8× |
79
- | Quality-preserving | 128 | +0.82% | ~32× |
80
- | Aggressive | 64 | +2.42% | ~64× |
81
- | Speed-only | 32 | +4.51% | ~128× |
82
-
83
- Note: the K-sweep recall numbers (24–41%) are not directly comparable to the
84
- in-training `evaluate()` recall (50.9% at K=128). Same checkpoint, same K,
85
- same metric code path the discrepancy comes from sampling different
86
- sequences out of the WikiText streaming split (different `num_batches` /
87
- worker dispatch). The PPL gap is independent of which subset is sampled
88
- and is the load-bearing deployment metric.
 
 
 
 
 
 
 
 
 
89
 
90
  ### Per-layer recall (pilot)
91
 
@@ -102,7 +111,24 @@ Early layers are harder for content-addressable retrieval — their attention
102
  is more local/positional than semantic. Consistent across K, so it's a
103
  property of the layer rather than noise.
104
 
105
- A 34-layer headline run on 8K context follows.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ## Files
108
 
@@ -144,7 +170,8 @@ the trained layers and run with FAISS HNSW retrieval at inference time.
144
  ## Training recipe
145
 
146
  - Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
147
- - Data: WikiText-103 raw, packed to 4K-token sequences.
 
148
  - 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
149
  - `α=β=1` (contrastive + KL distillation, both layers averaged).
150
  - bf16 weights, fp32 loss math.
 
70
  intact causal mask in retrieval, single-softmax renormalization with no
71
  wrapper leakage between iterations.
72
 
73
+ ### Compute / quality knobs (FLOP-counted)
74
 
75
+ `L = 4096`. Compute reduction is the attention scoring step, `L / K`.
76
+ These are FLOP estimates, not measured wall-clock — the FAISS path in this
77
+ repo is a research prototype that does CPU index builds and GPU↔CPU
78
+ transfers, so it is not the right thing to time.
79
+
80
+ | K | PPL gap | Attention scoring reduction |
81
+ |---|---|---|
82
+ | 512 | −2.89% | ~8× |
83
+ | 256 | −0.79% | ~16× |
84
+ | 128 | +0.82% | ~32× |
85
+ | 64 | +2.42% | ~64× |
86
+ | 32 | +4.51% | ~128× |
87
+ | 16 | +7.51% | ~256× |
88
+
89
+ Eval scope: 12 sequences × 4K tokens of WikiText-103 validation (~50K
90
+ tokens). Read these as "what we observed on this slice", not population-
91
+ level estimates.
92
+
93
+ The K-sweep recall numbers (24–41%) and the in-training `evaluate()` recall
94
+ (50.9% at K=128) come from different sampled subsets of the streaming split
95
+ and shouldn't be directly compared. The repo also reports `mass@K` (sum of
96
+ teacher attention probability captured by the search top-K) — that's the
97
+ more direct retrieval-quality metric when softmax is sharp.
98
 
99
  ### Per-layer recall (pilot)
100
 
 
111
  is more local/positional than semantic. Consistent across K, so it's a
112
  property of the layer rather than noise.
113
 
114
+ ### Caveats / what's next
115
+
116
+ - **Packing**: pilot training and eval ran with sequence packing on (no
117
+ segment-level causal mask, since transformers' default forward doesn't
118
+ build them). The relative PPL gap between full and ANN is internally
119
+ consistent under this confound, but the negative gap at K≥256 has at
120
+ least three candidate explanations we haven't disentangled —
121
+ (a) sparse-softmax denoising, (b) ANN happening to filter cross-document
122
+ keys that full attention attends to, (c) sample noise on a small eval.
123
+ The default config now has packing off so the next run isolates (a).
124
+ - **Exact-topK oracle**: a four-way Pareto (full vs. exact top-K vs.
125
+ search-topK exact vs. search-ANN) is the natural follow-up to separate
126
+ "denoising from any sparsity" from "denoising from learned projections."
127
+ - **Wall-clock**: not measured. The FAISS path in the repo is a CPU-side
128
+ research prototype, not a deployable runtime. A GPU-resident topk kernel
129
+ is the next-step engineering.
130
+ - **34-layer headline** was queued (`make_headline_config()` is wired) and
131
+ will mirror its checkpoints here when it runs.
132
 
133
  ## Files
134
 
 
170
  ## Training recipe
171
 
172
  - Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
173
+ - Data: WikiText-103 raw, 4K-token sequences (packing was on at training
174
+ time; default in the repo is now off — see Caveats).
175
  - 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
176
  - `α=β=1` (contrastive + KL distillation, both layers averaged).
177
  - bf16 weights, fp32 loss math.