guygrigsby commited on
Commit
d064e52
·
verified ·
1 Parent(s): eda3b84

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -24,26 +24,26 @@ Code, full writeup, and methodology: **[github.com/guygrigsby/diff-mlx](https://
24
  | `diff/latest.safetensors` | Differential Attention | 162M params, 2.0B tokens, seed 0 |
25
  | `vanilla/latest.safetensors` | Vanilla MHA baseline | 162M params, 2.0B tokens, seed 0 |
26
 
27
- Each variant folder also includes its `config.json` and training `metrics.jsonl`. The two models share a **byte-identical paired init** and identical data order, so their difference isolates the attention variant.
28
 
29
  ## Model
30
 
31
- - Pre-norm LLaMA-style transformer: dim 768, 12 layers, RoPE (interleaved), SwiGLU, RMSNorm, tied embeddings, vocab 100277 (cl100k_base).
32
  - Context length 2048. bf16 mixed precision.
33
- - Trained on a FineWeb-Edu sample, 2.0B tokens, effective batch 32, peak LR 4e-4, 1000-step warmup, single M5 Max.
34
 
35
- ## Headline result (the interesting part)
36
 
37
- On held-out validation, **vanilla edges out diff** at this scale, despite diff winning on train loss:
38
 
39
  | metric | diff | vanilla | δ (diff − vanilla) |
40
  |---|---|---|---|
41
- | Final train loss (last 1000-step mean) | 3.0414 | 3.1526 | −0.111 (diff lower) |
42
- | Held-out val (75M tok) @ step 30000 | 3.3616 | 3.3265 | +0.035 (vanilla lower) |
43
 
44
- Diff's train-loss advantage is memorization: its val loss *rose* over the final leg while train loss kept falling. A position-binned eval found vanilla uniformly better across the whole 2048-token window, with no widening of diff's deficit at later positions, so the architecture's signature long-context advantage did not appear here either.
45
 
46
- This is **three orders of magnitude below** the paper's 3B-param / 1T-token setup, so it refutes nothing about the paper. It is an honest negative for this small-scale, short-context, single-seed regime. See the repo writeup for the full discussion.
47
 
48
  ## Loading
49
 
 
24
  | `diff/latest.safetensors` | Differential Attention | 162M params, 2.0B tokens, seed 0 |
25
  | `vanilla/latest.safetensors` | Vanilla MHA baseline | 162M params, 2.0B tokens, seed 0 |
26
 
27
+ Each variant folder also has its `config.json` and training `metrics.jsonl`. The two models share a **byte-identical paired init** and identical data order, so the difference between them isolates the attention variant.
28
 
29
  ## Model
30
 
31
+ - Pre-norm LLaMA-style transformer: dim 768, 12 layers, interleaved RoPE, SwiGLU, RMSNorm, tied embeddings, vocab 100277 (cl100k_base).
32
  - Context length 2048. bf16 mixed precision.
33
+ - Trained on a FineWeb-Edu sample, 2.0B tokens, effective batch 32, peak LR 4e-4, 1000-step warmup, on one M5 Max.
34
 
35
+ ## The headline (the interesting part)
36
 
37
+ On held-out validation, **vanilla edges out diff** at this scale, even though diff wins on train loss:
38
 
39
  | metric | diff | vanilla | δ (diff − vanilla) |
40
  |---|---|---|---|
41
+ | final train loss (last 1000-step mean) | 3.0414 | 3.1526 | −0.111 (diff lower) |
42
+ | held-out val (75M tok) @ step 30000 | 3.3616 | 3.3265 | +0.035 (vanilla lower) |
43
 
44
+ Diff's train-loss lead is memorization: its val loss *rose* over the final leg while train loss kept falling. A position-binned eval put vanilla uniformly ahead across the whole 2048-token window, with no widening of diff's deficit at later positions, so the architecture's long-context edge didn't show up here either.
45
 
46
+ This sits **three orders of magnitude below** the paper's 3B-param / 1T-token setup, so it refutes nothing about the paper. It's an honest negative for this small-scale, short-context, single-seed regime. Full discussion in the repo writeup.
47
 
48
  ## Loading
49