Jinrui commited on
Commit
9b4a890
·
verified ·
1 Parent(s): 65ec4db

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - scaling-laws
5
+ - hyperbolic-geometry
6
+ - language-model
7
+ - research
8
+ ---
9
+
10
+ # HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws
11
+
12
+ Scaling law experiments comparing **Euclidean** (standard dot-product) vs **Hyperbolic** (Lorentz model) output layers for Qwen3 language models on OpenWebText.
13
+
14
+ ## Project
15
+
16
+ - **Repository**: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
17
+ - **Base model**: Qwen3 architecture (custom sizes, untied embeddings)
18
+ - **Dataset**: OpenWebText (8.39B tokens total)
19
+ - **Optimizer**: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
20
+ - **Training**: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB
21
+
22
+ ## Key Differences
23
+
24
+ | | Euclidean | Hyperbolic |
25
+ |---|---|---|
26
+ | Output layer | Standard linear (dot-product logits) | Lorentz hyperboloid (Minkowski inner product logits) |
27
+ | lm_head init | zeros | std=0.02 |
28
+ | embed_tokens init | std=1.0 | std=1.0 |
29
+ | tie_word_embeddings | false | false |
30
+ | Logit computation | `hidden @ lm_head.T` (fp32) | `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) |
31
+ | logit_scale | N/A | `d_model / sinh(1)` (learnable) |
32
+
33
+ ## Known Issues
34
+
35
+ **c_proj zero init missing**: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).
36
+
37
+ ## Results
38
+
39
+ Token budget `t1_N` means `1/N` of the full 8.39B token dataset. **Delta < 0 means hyperbolic is better.**
40
+
41
+ | Size | Params | Tokens | Hyp | Euc | Delta |
42
+ |---|---|---|---|---|---|
43
+ | p020m | 20M | 65.6M (t1_128) | 9.9145 | 10.0184 | -0.1039 |
44
+ | p020m | 20M | 131M (t1_64) | 8.4028 | 8.4610 | -0.0582 |
45
+ | p020m | 20M | 262M (t1_32) | 7.1117 | 7.4237 | -0.3120 |
46
+ | p020m | 20M | 524M (t1_16) | 6.0795 | 6.4958 | -0.4163 |
47
+ | p047m | 47M | 65.6M (t1_128) | 8.5956 | 8.6146 | -0.0190 |
48
+ | p047m | 47M | 262M (t1_32) | 6.0944 | 6.3668 | -0.2724 |
49
+ | p047m | 47M | 524M (t1_16) | 5.5383 | 5.7709 | -0.2326 |
50
+ | p109m | 109M | 131M (t1_64) | 6.1340 | 6.3509 | -0.2169 |
51
+ | p109m | 109M | 262M (t1_32) | 5.5677 | 5.7453 | -0.1776 |
52
+ | p109m | 109M | 524M (t1_16) | 5.3841 | 5.5431 | -0.1590 |
53
+ | p223m | 223M | 131M (t1_64) | 5.8612 | 6.0407 | -0.1795 |
54
+ | p407m | 407M | 65.6M (t1_128) | 6.8377 | 7.1486 | -0.3109 |
55
+ | p407m | 407M | 262M (t1_32) | 4.5280 | 5.1510 | -0.6230 |
56
+ | p407m | 407M | 524M (t1_16) | 4.4119 | 4.5080 | -0.0961 |
57
+ | p407m | 407M | 1.05B (t1_8) | 3.4614 | 3.9945 | -0.5331 |
58
+ | p407m | 407M | 2.10B (t1_4) | 3.5738 | 3.6206 | -0.0468 |
59
+ | p407m | 407M | 4.20B (t1_2) | 3.3230 | 3.3503 | -0.0273 |
60
+ | p407m | 407M | 8.39B (t1_1) | 3.1236 | -- | -- |
61
+ | p686m | 686M | 131M (t1_64) | 5.5675 | 5.7105 | -0.1430 |
62
+ | p686m | 686M | 262M (t1_32) | 4.9169 | 5.0321 | -0.1152 |
63
+ | p686m | 686M | 524M (t1_16) | 4.2684 | 4.3334 | -0.0650 |
64
+ | p686m | 686M | 1.05B (t1_8) | 3.7796 | 3.8389 | -0.0593 |
65
+ | p686m | 686M | 4.20B (t1_2) | 3.3219 | 3.2237 | +0.0982 |
66
+ | p686m | 686M | 65.6M (t1_128) | 11.9171* | 6.5685 | -- |
67
+ | p1083m | 1.08B | 65.6M (t1_128) | 6.9223 | 7.9453 | -1.0230 |
68
+ | p1083m | 1.08B | 262M (t1_32) | 4.4833 | 7.0858* | -- |
69
+ | p1083m | 1.08B | 524M (t1_16) | 4.1417 | 4.2060 | -0.0643 |
70
+ | p1083m | 1.08B | 1.05B (t1_8) | 3.3901 | 3.7304 | -0.3403 |
71
+ | p1083m | 1.08B | 4.20B (t1_2) | 3.1223 | 3.2614 | -0.1391 |
72
+ | p1083m | 1.08B | 8.39B (t1_1) | 2.9269 | -- | -- |
73
+ | p1621m | 1.62B | all | NaN | 3.30-6.54 | -- |
74
+ | p2324m | 2.32B | 262M (t1_32) | 4.6533 | 4.7401 | -0.0868 |
75
+ | p2324m | 2.32B | 524M (t1_16) | 4.0000 | 4.0594 | -0.0594 |
76
+ | p2324m | 2.32B | 4.20B (t1_2) | 3.0077 | 3.2524 | -0.2447 |
77
+ | p2324m | 2.32B | 8.39B (t1_1) | 3.1524 | -- | -- |
78
+
79
+ \* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).
80
+
81
+ ## Summary
82
+
83
+ - **Hyperbolic is consistently better** in most matched comparisons (lower eval loss)
84
+ - Average improvement: ~5-13% relative reduction in loss
85
+ - Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
86
+ - **Caveat**: init difference (lm_head zeros vs std=0.02) confounds the comparison
87
+ - Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)
88
+
89
+ ## Repository Structure
90
+
91
+ ```
92
+ checkpoints/
93
+ qwen3/ # Euclidean models
94
+ qwen3_p{SIZE}_t1_{N}_.../
95
+ attempt{K}_{DATE}/
96
+ checkpoint-{STEP}/ # Final checkpoint
97
+ qwen3_hyp/ # Hyperbolic models
98
+ qwen3_hyp_p{SIZE}_t1_{N}_.../
99
+ attempt{K}_{DATE}/
100
+ checkpoint-{STEP}/
101
+
102
+ results/ # Training logs (trainer_state.json)
103
+ scaling_law/owt/
104
+ qwen3/owt_scaling_v3/
105
+ qwen3_hyp/owt_scaling_v3/
106
+ ```
107
+
108
+ ## Experiment Configuration
109
+
110
+ | Parameter | Value |
111
+ |---|---|
112
+ | Architecture | Qwen3 (custom sizes) |
113
+ | Vocab size | 151,936 |
114
+ | Context length | 1,024 |
115
+ | Dataset | OpenWebText (8.39B tokens) |
116
+ | Optimizer | NanochatMuon (Muon + per-group AdamW) |
117
+ | Muon targets | 2D transformer matrices |
118
+ | AdamW groups | embed (lr=0.2*scale), lm_head (lr=0.004*scale), misc (lr=0.004*scale) |
119
+ | LR scaling | (d_model/768)^(-0.5) |
120
+ | Precision | bf16 mixed precision |
121
+ | Infrastructure | 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 |
122
+
123
+ ## Citation
124
+
125
+ ```bibtex
126
+ @misc{hyperscale2026,
127
+ title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
128
+ author={Jinrui Lin},
129
+ year={2026},
130
+ url={https://github.com/ObliviateRickLin/HyperScale}
131
+ }
132
+ ```