Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- scaling-laws
|
| 5 |
+
- hyperbolic-geometry
|
| 6 |
+
- language-model
|
| 7 |
+
- research
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws
|
| 11 |
+
|
| 12 |
+
Scaling law experiments comparing **Euclidean** (standard dot-product) vs **Hyperbolic** (Lorentz model) output layers for Qwen3 language models on OpenWebText.
|
| 13 |
+
|
| 14 |
+
## Project
|
| 15 |
+
|
| 16 |
+
- **Repository**: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
|
| 17 |
+
- **Base model**: Qwen3 architecture (custom sizes, untied embeddings)
|
| 18 |
+
- **Dataset**: OpenWebText (8.39B tokens total)
|
| 19 |
+
- **Optimizer**: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
|
| 20 |
+
- **Training**: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB
|
| 21 |
+
|
| 22 |
+
## Key Differences
|
| 23 |
+
|
| 24 |
+
| | Euclidean | Hyperbolic |
|
| 25 |
+
|---|---|---|
|
| 26 |
+
| Output layer | Standard linear (dot-product logits) | Lorentz hyperboloid (Minkowski inner product logits) |
|
| 27 |
+
| lm_head init | zeros | std=0.02 |
|
| 28 |
+
| embed_tokens init | std=1.0 | std=1.0 |
|
| 29 |
+
| tie_word_embeddings | false | false |
|
| 30 |
+
| Logit computation | `hidden @ lm_head.T` (fp32) | `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) |
|
| 31 |
+
| logit_scale | N/A | `d_model / sinh(1)` (learnable) |
|
| 32 |
+
|
| 33 |
+
## Known Issues
|
| 34 |
+
|
| 35 |
+
**c_proj zero init missing**: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).
|
| 36 |
+
|
| 37 |
+
## Results
|
| 38 |
+
|
| 39 |
+
Token budget `t1_N` means `1/N` of the full 8.39B token dataset. **Delta < 0 means hyperbolic is better.**
|
| 40 |
+
|
| 41 |
+
| Size | Params | Tokens | Hyp | Euc | Delta |
|
| 42 |
+
|---|---|---|---|---|---|
|
| 43 |
+
| p020m | 20M | 65.6M (t1_128) | 9.9145 | 10.0184 | -0.1039 |
|
| 44 |
+
| p020m | 20M | 131M (t1_64) | 8.4028 | 8.4610 | -0.0582 |
|
| 45 |
+
| p020m | 20M | 262M (t1_32) | 7.1117 | 7.4237 | -0.3120 |
|
| 46 |
+
| p020m | 20M | 524M (t1_16) | 6.0795 | 6.4958 | -0.4163 |
|
| 47 |
+
| p047m | 47M | 65.6M (t1_128) | 8.5956 | 8.6146 | -0.0190 |
|
| 48 |
+
| p047m | 47M | 262M (t1_32) | 6.0944 | 6.3668 | -0.2724 |
|
| 49 |
+
| p047m | 47M | 524M (t1_16) | 5.5383 | 5.7709 | -0.2326 |
|
| 50 |
+
| p109m | 109M | 131M (t1_64) | 6.1340 | 6.3509 | -0.2169 |
|
| 51 |
+
| p109m | 109M | 262M (t1_32) | 5.5677 | 5.7453 | -0.1776 |
|
| 52 |
+
| p109m | 109M | 524M (t1_16) | 5.3841 | 5.5431 | -0.1590 |
|
| 53 |
+
| p223m | 223M | 131M (t1_64) | 5.8612 | 6.0407 | -0.1795 |
|
| 54 |
+
| p407m | 407M | 65.6M (t1_128) | 6.8377 | 7.1486 | -0.3109 |
|
| 55 |
+
| p407m | 407M | 262M (t1_32) | 4.5280 | 5.1510 | -0.6230 |
|
| 56 |
+
| p407m | 407M | 524M (t1_16) | 4.4119 | 4.5080 | -0.0961 |
|
| 57 |
+
| p407m | 407M | 1.05B (t1_8) | 3.4614 | 3.9945 | -0.5331 |
|
| 58 |
+
| p407m | 407M | 2.10B (t1_4) | 3.5738 | 3.6206 | -0.0468 |
|
| 59 |
+
| p407m | 407M | 4.20B (t1_2) | 3.3230 | 3.3503 | -0.0273 |
|
| 60 |
+
| p407m | 407M | 8.39B (t1_1) | 3.1236 | -- | -- |
|
| 61 |
+
| p686m | 686M | 131M (t1_64) | 5.5675 | 5.7105 | -0.1430 |
|
| 62 |
+
| p686m | 686M | 262M (t1_32) | 4.9169 | 5.0321 | -0.1152 |
|
| 63 |
+
| p686m | 686M | 524M (t1_16) | 4.2684 | 4.3334 | -0.0650 |
|
| 64 |
+
| p686m | 686M | 1.05B (t1_8) | 3.7796 | 3.8389 | -0.0593 |
|
| 65 |
+
| p686m | 686M | 4.20B (t1_2) | 3.3219 | 3.2237 | +0.0982 |
|
| 66 |
+
| p686m | 686M | 65.6M (t1_128) | 11.9171* | 6.5685 | -- |
|
| 67 |
+
| p1083m | 1.08B | 65.6M (t1_128) | 6.9223 | 7.9453 | -1.0230 |
|
| 68 |
+
| p1083m | 1.08B | 262M (t1_32) | 4.4833 | 7.0858* | -- |
|
| 69 |
+
| p1083m | 1.08B | 524M (t1_16) | 4.1417 | 4.2060 | -0.0643 |
|
| 70 |
+
| p1083m | 1.08B | 1.05B (t1_8) | 3.3901 | 3.7304 | -0.3403 |
|
| 71 |
+
| p1083m | 1.08B | 4.20B (t1_2) | 3.1223 | 3.2614 | -0.1391 |
|
| 72 |
+
| p1083m | 1.08B | 8.39B (t1_1) | 2.9269 | -- | -- |
|
| 73 |
+
| p1621m | 1.62B | all | NaN | 3.30-6.54 | -- |
|
| 74 |
+
| p2324m | 2.32B | 262M (t1_32) | 4.6533 | 4.7401 | -0.0868 |
|
| 75 |
+
| p2324m | 2.32B | 524M (t1_16) | 4.0000 | 4.0594 | -0.0594 |
|
| 76 |
+
| p2324m | 2.32B | 4.20B (t1_2) | 3.0077 | 3.2524 | -0.2447 |
|
| 77 |
+
| p2324m | 2.32B | 8.39B (t1_1) | 3.1524 | -- | -- |
|
| 78 |
+
|
| 79 |
+
\* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).
|
| 80 |
+
|
| 81 |
+
## Summary
|
| 82 |
+
|
| 83 |
+
- **Hyperbolic is consistently better** in most matched comparisons (lower eval loss)
|
| 84 |
+
- Average improvement: ~5-13% relative reduction in loss
|
| 85 |
+
- Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
|
| 86 |
+
- **Caveat**: init difference (lm_head zeros vs std=0.02) confounds the comparison
|
| 87 |
+
- Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)
|
| 88 |
+
|
| 89 |
+
## Repository Structure
|
| 90 |
+
|
| 91 |
+
```
|
| 92 |
+
checkpoints/
|
| 93 |
+
qwen3/ # Euclidean models
|
| 94 |
+
qwen3_p{SIZE}_t1_{N}_.../
|
| 95 |
+
attempt{K}_{DATE}/
|
| 96 |
+
checkpoint-{STEP}/ # Final checkpoint
|
| 97 |
+
qwen3_hyp/ # Hyperbolic models
|
| 98 |
+
qwen3_hyp_p{SIZE}_t1_{N}_.../
|
| 99 |
+
attempt{K}_{DATE}/
|
| 100 |
+
checkpoint-{STEP}/
|
| 101 |
+
|
| 102 |
+
results/ # Training logs (trainer_state.json)
|
| 103 |
+
scaling_law/owt/
|
| 104 |
+
qwen3/owt_scaling_v3/
|
| 105 |
+
qwen3_hyp/owt_scaling_v3/
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
## Experiment Configuration
|
| 109 |
+
|
| 110 |
+
| Parameter | Value |
|
| 111 |
+
|---|---|
|
| 112 |
+
| Architecture | Qwen3 (custom sizes) |
|
| 113 |
+
| Vocab size | 151,936 |
|
| 114 |
+
| Context length | 1,024 |
|
| 115 |
+
| Dataset | OpenWebText (8.39B tokens) |
|
| 116 |
+
| Optimizer | NanochatMuon (Muon + per-group AdamW) |
|
| 117 |
+
| Muon targets | 2D transformer matrices |
|
| 118 |
+
| AdamW groups | embed (lr=0.2*scale), lm_head (lr=0.004*scale), misc (lr=0.004*scale) |
|
| 119 |
+
| LR scaling | (d_model/768)^(-0.5) |
|
| 120 |
+
| Precision | bf16 mixed precision |
|
| 121 |
+
| Infrastructure | 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 |
|
| 122 |
+
|
| 123 |
+
## Citation
|
| 124 |
+
|
| 125 |
+
```bibtex
|
| 126 |
+
@misc{hyperscale2026,
|
| 127 |
+
title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
|
| 128 |
+
author={Jinrui Lin},
|
| 129 |
+
year={2026},
|
| 130 |
+
url={https://github.com/ObliviateRickLin/HyperScale}
|
| 131 |
+
}
|
| 132 |
+
```
|