File size: 5,594 Bytes
9b4a890
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
tags:
- scaling-laws
- hyperbolic-geometry
- language-model
- research
---

# HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws

Scaling law experiments comparing **Euclidean** (standard dot-product) vs **Hyperbolic** (Lorentz model) output layers for Qwen3 language models on OpenWebText.

## Project

- **Repository**: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
- **Base model**: Qwen3 architecture (custom sizes, untied embeddings)
- **Dataset**: OpenWebText (8.39B tokens total)
- **Optimizer**: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
- **Training**: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB

## Key Differences

| | Euclidean | Hyperbolic |
|---|---|---|
| Output layer | Standard linear (dot-product logits) | Lorentz hyperboloid (Minkowski inner product logits) |
| lm_head init | zeros | std=0.02 |
| embed_tokens init | std=1.0 | std=1.0 |
| tie_word_embeddings | false | false |
| Logit computation | `hidden @ lm_head.T` (fp32) | `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) |
| logit_scale | N/A | `d_model / sinh(1)` (learnable) |

## Known Issues

**c_proj zero init missing**: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).

## Results

Token budget `t1_N` means `1/N` of the full 8.39B token dataset. **Delta < 0 means hyperbolic is better.**

| Size | Params | Tokens | Hyp | Euc | Delta |
|---|---|---|---|---|---|
| p020m | 20M | 65.6M (t1_128) | 9.9145 | 10.0184 | -0.1039 |
| p020m | 20M | 131M (t1_64) | 8.4028 | 8.4610 | -0.0582 |
| p020m | 20M | 262M (t1_32) | 7.1117 | 7.4237 | -0.3120 |
| p020m | 20M | 524M (t1_16) | 6.0795 | 6.4958 | -0.4163 |
| p047m | 47M | 65.6M (t1_128) | 8.5956 | 8.6146 | -0.0190 |
| p047m | 47M | 262M (t1_32) | 6.0944 | 6.3668 | -0.2724 |
| p047m | 47M | 524M (t1_16) | 5.5383 | 5.7709 | -0.2326 |
| p109m | 109M | 131M (t1_64) | 6.1340 | 6.3509 | -0.2169 |
| p109m | 109M | 262M (t1_32) | 5.5677 | 5.7453 | -0.1776 |
| p109m | 109M | 524M (t1_16) | 5.3841 | 5.5431 | -0.1590 |
| p223m | 223M | 131M (t1_64) | 5.8612 | 6.0407 | -0.1795 |
| p407m | 407M | 65.6M (t1_128) | 6.8377 | 7.1486 | -0.3109 |
| p407m | 407M | 262M (t1_32) | 4.5280 | 5.1510 | -0.6230 |
| p407m | 407M | 524M (t1_16) | 4.4119 | 4.5080 | -0.0961 |
| p407m | 407M | 1.05B (t1_8) | 3.4614 | 3.9945 | -0.5331 |
| p407m | 407M | 2.10B (t1_4) | 3.5738 | 3.6206 | -0.0468 |
| p407m | 407M | 4.20B (t1_2) | 3.3230 | 3.3503 | -0.0273 |
| p407m | 407M | 8.39B (t1_1) | 3.1236 | -- | -- |
| p686m | 686M | 131M (t1_64) | 5.5675 | 5.7105 | -0.1430 |
| p686m | 686M | 262M (t1_32) | 4.9169 | 5.0321 | -0.1152 |
| p686m | 686M | 524M (t1_16) | 4.2684 | 4.3334 | -0.0650 |
| p686m | 686M | 1.05B (t1_8) | 3.7796 | 3.8389 | -0.0593 |
| p686m | 686M | 4.20B (t1_2) | 3.3219 | 3.2237 | +0.0982 |
| p686m | 686M | 65.6M (t1_128) | 11.9171* | 6.5685 | -- |
| p1083m | 1.08B | 65.6M (t1_128) | 6.9223 | 7.9453 | -1.0230 |
| p1083m | 1.08B | 262M (t1_32) | 4.4833 | 7.0858* | -- |
| p1083m | 1.08B | 524M (t1_16) | 4.1417 | 4.2060 | -0.0643 |
| p1083m | 1.08B | 1.05B (t1_8) | 3.3901 | 3.7304 | -0.3403 |
| p1083m | 1.08B | 4.20B (t1_2) | 3.1223 | 3.2614 | -0.1391 |
| p1083m | 1.08B | 8.39B (t1_1) | 2.9269 | -- | -- |
| p1621m | 1.62B | all | NaN | 3.30-6.54 | -- |
| p2324m | 2.32B | 262M (t1_32) | 4.6533 | 4.7401 | -0.0868 |
| p2324m | 2.32B | 524M (t1_16) | 4.0000 | 4.0594 | -0.0594 |
| p2324m | 2.32B | 4.20B (t1_2) | 3.0077 | 3.2524 | -0.2447 |
| p2324m | 2.32B | 8.39B (t1_1) | 3.1524 | -- | -- |

\* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).

## Summary

- **Hyperbolic is consistently better** in most matched comparisons (lower eval loss)
- Average improvement: ~5-13% relative reduction in loss
- Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
- **Caveat**: init difference (lm_head zeros vs std=0.02) confounds the comparison
- Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)

## Repository Structure

```
checkpoints/
  qwen3/                            # Euclidean models
    qwen3_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/          # Final checkpoint
  qwen3_hyp/                        # Hyperbolic models
    qwen3_hyp_p{SIZE}_t1_{N}_.../
      attempt{K}_{DATE}/
        checkpoint-{STEP}/

results/                            # Training logs (trainer_state.json)
  scaling_law/owt/
    qwen3/owt_scaling_v3/
    qwen3_hyp/owt_scaling_v3/
```

## Experiment Configuration

| Parameter | Value |
|---|---|
| Architecture | Qwen3 (custom sizes) |
| Vocab size | 151,936 |
| Context length | 1,024 |
| Dataset | OpenWebText (8.39B tokens) |
| Optimizer | NanochatMuon (Muon + per-group AdamW) |
| Muon targets | 2D transformer matrices |
| AdamW groups | embed (lr=0.2*scale), lm_head (lr=0.004*scale), misc (lr=0.004*scale) |
| LR scaling | (d_model/768)^(-0.5) |
| Precision | bf16 mixed precision |
| Infrastructure | 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 |

## Citation

```bibtex
@misc{hyperscale2026,
  title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
  author={Jinrui Lin},
  year={2026},
  url={https://github.com/ObliviateRickLin/HyperScale}
}
```