File size: 5,594 Bytes
9b4a890 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
license: mit
tags:
- scaling-laws
- hyperbolic-geometry
- language-model
- research
---
# HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws
Scaling law experiments comparing **Euclidean** (standard dot-product) vs **Hyperbolic** (Lorentz model) output layers for Qwen3 language models on OpenWebText.
## Project
- **Repository**: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
- **Base model**: Qwen3 architecture (custom sizes, untied embeddings)
- **Dataset**: OpenWebText (8.39B tokens total)
- **Optimizer**: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
- **Training**: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB
## Key Differences
| | Euclidean | Hyperbolic |
|---|---|---|
| Output layer | Standard linear (dot-product logits) | Lorentz hyperboloid (Minkowski inner product logits) |
| lm_head init | zeros | std=0.02 |
| embed_tokens init | std=1.0 | std=1.0 |
| tie_word_embeddings | false | false |
| Logit computation | `hidden @ lm_head.T` (fp32) | `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) |
| logit_scale | N/A | `d_model / sinh(1)` (learnable) |
## Known Issues
**c_proj zero init missing**: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).
## Results
Token budget `t1_N` means `1/N` of the full 8.39B token dataset. **Delta < 0 means hyperbolic is better.**
| Size | Params | Tokens | Hyp | Euc | Delta |
|---|---|---|---|---|---|
| p020m | 20M | 65.6M (t1_128) | 9.9145 | 10.0184 | -0.1039 |
| p020m | 20M | 131M (t1_64) | 8.4028 | 8.4610 | -0.0582 |
| p020m | 20M | 262M (t1_32) | 7.1117 | 7.4237 | -0.3120 |
| p020m | 20M | 524M (t1_16) | 6.0795 | 6.4958 | -0.4163 |
| p047m | 47M | 65.6M (t1_128) | 8.5956 | 8.6146 | -0.0190 |
| p047m | 47M | 262M (t1_32) | 6.0944 | 6.3668 | -0.2724 |
| p047m | 47M | 524M (t1_16) | 5.5383 | 5.7709 | -0.2326 |
| p109m | 109M | 131M (t1_64) | 6.1340 | 6.3509 | -0.2169 |
| p109m | 109M | 262M (t1_32) | 5.5677 | 5.7453 | -0.1776 |
| p109m | 109M | 524M (t1_16) | 5.3841 | 5.5431 | -0.1590 |
| p223m | 223M | 131M (t1_64) | 5.8612 | 6.0407 | -0.1795 |
| p407m | 407M | 65.6M (t1_128) | 6.8377 | 7.1486 | -0.3109 |
| p407m | 407M | 262M (t1_32) | 4.5280 | 5.1510 | -0.6230 |
| p407m | 407M | 524M (t1_16) | 4.4119 | 4.5080 | -0.0961 |
| p407m | 407M | 1.05B (t1_8) | 3.4614 | 3.9945 | -0.5331 |
| p407m | 407M | 2.10B (t1_4) | 3.5738 | 3.6206 | -0.0468 |
| p407m | 407M | 4.20B (t1_2) | 3.3230 | 3.3503 | -0.0273 |
| p407m | 407M | 8.39B (t1_1) | 3.1236 | -- | -- |
| p686m | 686M | 131M (t1_64) | 5.5675 | 5.7105 | -0.1430 |
| p686m | 686M | 262M (t1_32) | 4.9169 | 5.0321 | -0.1152 |
| p686m | 686M | 524M (t1_16) | 4.2684 | 4.3334 | -0.0650 |
| p686m | 686M | 1.05B (t1_8) | 3.7796 | 3.8389 | -0.0593 |
| p686m | 686M | 4.20B (t1_2) | 3.3219 | 3.2237 | +0.0982 |
| p686m | 686M | 65.6M (t1_128) | 11.9171* | 6.5685 | -- |
| p1083m | 1.08B | 65.6M (t1_128) | 6.9223 | 7.9453 | -1.0230 |
| p1083m | 1.08B | 262M (t1_32) | 4.4833 | 7.0858* | -- |
| p1083m | 1.08B | 524M (t1_16) | 4.1417 | 4.2060 | -0.0643 |
| p1083m | 1.08B | 1.05B (t1_8) | 3.3901 | 3.7304 | -0.3403 |
| p1083m | 1.08B | 4.20B (t1_2) | 3.1223 | 3.2614 | -0.1391 |
| p1083m | 1.08B | 8.39B (t1_1) | 2.9269 | -- | -- |
| p1621m | 1.62B | all | NaN | 3.30-6.54 | -- |
| p2324m | 2.32B | 262M (t1_32) | 4.6533 | 4.7401 | -0.0868 |
| p2324m | 2.32B | 524M (t1_16) | 4.0000 | 4.0594 | -0.0594 |
| p2324m | 2.32B | 4.20B (t1_2) | 3.0077 | 3.2524 | -0.2447 |
| p2324m | 2.32B | 8.39B (t1_1) | 3.1524 | -- | -- |
\* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).
## Summary
- **Hyperbolic is consistently better** in most matched comparisons (lower eval loss)
- Average improvement: ~5-13% relative reduction in loss
- Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
- **Caveat**: init difference (lm_head zeros vs std=0.02) confounds the comparison
- Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)
## Repository Structure
```
checkpoints/
qwen3/ # Euclidean models
qwen3_p{SIZE}_t1_{N}_.../
attempt{K}_{DATE}/
checkpoint-{STEP}/ # Final checkpoint
qwen3_hyp/ # Hyperbolic models
qwen3_hyp_p{SIZE}_t1_{N}_.../
attempt{K}_{DATE}/
checkpoint-{STEP}/
results/ # Training logs (trainer_state.json)
scaling_law/owt/
qwen3/owt_scaling_v3/
qwen3_hyp/owt_scaling_v3/
```
## Experiment Configuration
| Parameter | Value |
|---|---|
| Architecture | Qwen3 (custom sizes) |
| Vocab size | 151,936 |
| Context length | 1,024 |
| Dataset | OpenWebText (8.39B tokens) |
| Optimizer | NanochatMuon (Muon + per-group AdamW) |
| Muon targets | 2D transformer matrices |
| AdamW groups | embed (lr=0.2*scale), lm_head (lr=0.004*scale), misc (lr=0.004*scale) |
| LR scaling | (d_model/768)^(-0.5) |
| Precision | bf16 mixed precision |
| Infrastructure | 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 |
## Citation
```bibtex
@misc{hyperscale2026,
title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
author={Jinrui Lin},
year={2026},
url={https://github.com/ObliviateRickLin/HyperScale}
}
```
|