markhenry commited on
Commit
f9d8091
·
verified ·
1 Parent(s): 274f995

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pytorch
4
+ - language-model
5
+ - gpt
6
+ license: mit
7
+ ---
8
+
9
+ # vanilla-10b
10
+
11
+ Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b).
12
+
13
+ ## Architecture
14
+
15
+ | Parameter | Value |
16
+ |-----------|-------|
17
+ | n_layer | 12 |
18
+ | n_head | 8 |
19
+ | n_embd | 1024 |
20
+ | block_size | 1024 |
21
+ | vocab_size | 50304 |
22
+ | bias | False |
23
+ | norm | RMSNorm (affine) |
24
+ | MLP | GELU, 4x expansion |
25
+ | tokenizer | GPT-2 (tiktoken) |
26
+ | dtype | bfloat16 |
27
+ | sparsity | none (vanilla) |
28
+
29
+ ## Training
30
+
31
+ | Parameter | Value |
32
+ |-----------|-------|
33
+ | optimizer | Muon (hidden 2D) + AdamW (embeddings) |
34
+ | muon_lr | 0.006 |
35
+ | adamw_lr | 0.006 |
36
+ | lr_schedule | linear_warmdown (warmdown_frac=0.2) |
37
+ | batch_size | 80 |
38
+ | seq_len | 1024 |
39
+ | max_iters | 16000 |
40
+ | tokens seen | ~1.3B |
41
+ | dataset | FineWeb-Edu-10B |
42
+ | best_val_loss | 6.2834 |
43
+
44
+ ## Purpose
45
+
46
+ Interpretability comparison baseline. Trained with identical hyperparameters to
47
+ `cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct
48
+ comparison of residual stream representations.