hscale / README.md

Upload README.md with huggingface_hub

9b4a890 verified 3 days ago

5.59 kB

	---
	license: mit
	tags:
	- scaling-laws
	- hyperbolic-geometry
	- language-model
	- research
	---

	# HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws

	Scaling law experiments comparing Euclidean (standard dot-product) vs Hyperbolic (Lorentz model) output layers for Qwen3 language models on OpenWebText.

	## Project

	- Repository: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
	- Base model: Qwen3 architecture (custom sizes, untied embeddings)
	- Dataset: OpenWebText (8.39B tokens total)
	- Optimizer: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
	- Training: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB

	## Key Differences

	\| \| Euclidean \| Hyperbolic \|
	\|---\|---\|---\|
	\| Output layer \| Standard linear (dot-product logits) \| Lorentz hyperboloid (Minkowski inner product logits) \|
	\| lm_head init \| zeros \| std=0.02 \|
	\| embed_tokens init \| std=1.0 \| std=1.0 \|
	\| tie_word_embeddings \| false \| false \|
	\| Logit computation \| `hidden @ lm_head.T` (fp32) \| `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) \|
	\| logit_scale \| N/A \| `d_model / sinh(1)` (learnable) \|

	## Known Issues

	c_proj zero init missing: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).

	## Results

	Token budget `t1_N` means `1/N` of the full 8.39B token dataset. Delta < 0 means hyperbolic is better.

	\| Size \| Params \| Tokens \| Hyp \| Euc \| Delta \|
	\|---\|---\|---\|---\|---\|---\|
	\| p020m \| 20M \| 65.6M (t1_128) \| 9.9145 \| 10.0184 \| -0.1039 \|
	\| p020m \| 20M \| 131M (t1_64) \| 8.4028 \| 8.4610 \| -0.0582 \|
	\| p020m \| 20M \| 262M (t1_32) \| 7.1117 \| 7.4237 \| -0.3120 \|
	\| p020m \| 20M \| 524M (t1_16) \| 6.0795 \| 6.4958 \| -0.4163 \|
	\| p047m \| 47M \| 65.6M (t1_128) \| 8.5956 \| 8.6146 \| -0.0190 \|
	\| p047m \| 47M \| 262M (t1_32) \| 6.0944 \| 6.3668 \| -0.2724 \|
	\| p047m \| 47M \| 524M (t1_16) \| 5.5383 \| 5.7709 \| -0.2326 \|
	\| p109m \| 109M \| 131M (t1_64) \| 6.1340 \| 6.3509 \| -0.2169 \|
	\| p109m \| 109M \| 262M (t1_32) \| 5.5677 \| 5.7453 \| -0.1776 \|
	\| p109m \| 109M \| 524M (t1_16) \| 5.3841 \| 5.5431 \| -0.1590 \|
	\| p223m \| 223M \| 131M (t1_64) \| 5.8612 \| 6.0407 \| -0.1795 \|
	\| p407m \| 407M \| 65.6M (t1_128) \| 6.8377 \| 7.1486 \| -0.3109 \|
	\| p407m \| 407M \| 262M (t1_32) \| 4.5280 \| 5.1510 \| -0.6230 \|
	\| p407m \| 407M \| 524M (t1_16) \| 4.4119 \| 4.5080 \| -0.0961 \|
	\| p407m \| 407M \| 1.05B (t1_8) \| 3.4614 \| 3.9945 \| -0.5331 \|
	\| p407m \| 407M \| 2.10B (t1_4) \| 3.5738 \| 3.6206 \| -0.0468 \|
	\| p407m \| 407M \| 4.20B (t1_2) \| 3.3230 \| 3.3503 \| -0.0273 \|
	\| p407m \| 407M \| 8.39B (t1_1) \| 3.1236 \| -- \| -- \|
	\| p686m \| 686M \| 131M (t1_64) \| 5.5675 \| 5.7105 \| -0.1430 \|
	\| p686m \| 686M \| 262M (t1_32) \| 4.9169 \| 5.0321 \| -0.1152 \|
	\| p686m \| 686M \| 524M (t1_16) \| 4.2684 \| 4.3334 \| -0.0650 \|
	\| p686m \| 686M \| 1.05B (t1_8) \| 3.7796 \| 3.8389 \| -0.0593 \|
	\| p686m \| 686M \| 4.20B (t1_2) \| 3.3219 \| 3.2237 \| +0.0982 \|
	\| p686m \| 686M \| 65.6M (t1_128) \| 11.9171* \| 6.5685 \| -- \|
	\| p1083m \| 1.08B \| 65.6M (t1_128) \| 6.9223 \| 7.9453 \| -1.0230 \|
	\| p1083m \| 1.08B \| 262M (t1_32) \| 4.4833 \| 7.0858* \| -- \|
	\| p1083m \| 1.08B \| 524M (t1_16) \| 4.1417 \| 4.2060 \| -0.0643 \|
	\| p1083m \| 1.08B \| 1.05B (t1_8) \| 3.3901 \| 3.7304 \| -0.3403 \|
	\| p1083m \| 1.08B \| 4.20B (t1_2) \| 3.1223 \| 3.2614 \| -0.1391 \|
	\| p1083m \| 1.08B \| 8.39B (t1_1) \| 2.9269 \| -- \| -- \|
	\| p1621m \| 1.62B \| all \| NaN \| 3.30-6.54 \| -- \|
	\| p2324m \| 2.32B \| 262M (t1_32) \| 4.6533 \| 4.7401 \| -0.0868 \|
	\| p2324m \| 2.32B \| 524M (t1_16) \| 4.0000 \| 4.0594 \| -0.0594 \|
	\| p2324m \| 2.32B \| 4.20B (t1_2) \| 3.0077 \| 3.2524 \| -0.2447 \|
	\| p2324m \| 2.32B \| 8.39B (t1_1) \| 3.1524 \| -- \| -- \|

	\* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).

	## Summary

	- Hyperbolic is consistently better in most matched comparisons (lower eval loss)
	- Average improvement: ~5-13% relative reduction in loss
	- Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
	- Caveat: init difference (lm_head zeros vs std=0.02) confounds the comparison
	- Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)

	## Repository Structure

	```
	checkpoints/
	qwen3/ # Euclidean models
	qwen3_p{SIZE}_t1_{N}_.../
	attempt{K}_{DATE}/
	checkpoint-{STEP}/ # Final checkpoint
	qwen3_hyp/ # Hyperbolic models
	qwen3_hyp_p{SIZE}_t1_{N}_.../
	attempt{K}_{DATE}/
	checkpoint-{STEP}/

	results/ # Training logs (trainer_state.json)
	scaling_law/owt/
	qwen3/owt_scaling_v3/
	qwen3_hyp/owt_scaling_v3/
	```

	## Experiment Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| Qwen3 (custom sizes) \|
	\| Vocab size \| 151,936 \|
	\| Context length \| 1,024 \|
	\| Dataset \| OpenWebText (8.39B tokens) \|
	\| Optimizer \| NanochatMuon (Muon + per-group AdamW) \|
	\| Muon targets \| 2D transformer matrices \|
	\| AdamW groups \| embed (lr=0.2scale), lm_head (lr=0.004scale), misc (lr=0.004*scale) \|
	\| LR scaling \| (d_model/768)^(-0.5) \|
	\| Precision \| bf16 mixed precision \|
	\| Infrastructure \| 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 \|

	## Citation

	```bibtex
	@misc{hyperscale2026,
	title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
	author={Jinrui Lin},
	year={2026},
	url={https://github.com/ObliviateRickLin/HyperScale}
	}
	```

	---
	license: mit
	tags:
	- scaling-laws
	- hyperbolic-geometry
	- language-model
	- research
	---

	# HyperScale: Euclidean vs Hyperbolic Output Layer Scaling Laws

	Scaling law experiments comparing Euclidean (standard dot-product) vs Hyperbolic (Lorentz model) output layers for Qwen3 language models on OpenWebText.

	## Project

	- Repository: [github.com/ObliviateRickLin/HyperScale](https://github.com/ObliviateRickLin/HyperScale)
	- Base model: Qwen3 architecture (custom sizes, untied embeddings)
	- Dataset: OpenWebText (8.39B tokens total)
	- Optimizer: NanochatMuon (Muon for 2D transformer matrices + per-group AdamW)
	- Training: DeepSpeed ZeRO-2, bf16 mixed precision, 4x H100 80GB

	## Key Differences

	\| \| Euclidean \| Hyperbolic \|
	\|---\|---\|---\|
	\| Output layer \| Standard linear (dot-product logits) \| Lorentz hyperboloid (Minkowski inner product logits) \|
	\| lm_head init \| zeros \| std=0.02 \|
	\| embed_tokens init \| std=1.0 \| std=1.0 \|
	\| tie_word_embeddings \| false \| false \|
	\| Logit computation \| `hidden @ lm_head.T` (fp32) \| `<expmap(h), expmap(w)>_L * scale` (fp32, mean-centered) \|
	\| logit_scale \| N/A \| `d_model / sinh(1)` (learnable) \|

	## Known Issues

	c_proj zero init missing: Neither Euclidean nor Hyperbolic models zero-initialize `self_attn.o_proj` and `mlp.down_proj` (Qwen3 equivalent of Karpathy c_proj). In nanochat/HypGPT reference, these ARE zeroed alongside lm_head. This is a known confound. Euclidean (lm_head=zeros) is more impacted than Hyperbolic (lm_head=std=0.02).

	## Results

	Token budget `t1_N` means `1/N` of the full 8.39B token dataset. Delta < 0 means hyperbolic is better.

	\| Size \| Params \| Tokens \| Hyp \| Euc \| Delta \|
	\|---\|---\|---\|---\|---\|---\|
	\| p020m \| 20M \| 65.6M (t1_128) \| 9.9145 \| 10.0184 \| -0.1039 \|
	\| p020m \| 20M \| 131M (t1_64) \| 8.4028 \| 8.4610 \| -0.0582 \|
	\| p020m \| 20M \| 262M (t1_32) \| 7.1117 \| 7.4237 \| -0.3120 \|
	\| p020m \| 20M \| 524M (t1_16) \| 6.0795 \| 6.4958 \| -0.4163 \|
	\| p047m \| 47M \| 65.6M (t1_128) \| 8.5956 \| 8.6146 \| -0.0190 \|
	\| p047m \| 47M \| 262M (t1_32) \| 6.0944 \| 6.3668 \| -0.2724 \|
	\| p047m \| 47M \| 524M (t1_16) \| 5.5383 \| 5.7709 \| -0.2326 \|
	\| p109m \| 109M \| 131M (t1_64) \| 6.1340 \| 6.3509 \| -0.2169 \|
	\| p109m \| 109M \| 262M (t1_32) \| 5.5677 \| 5.7453 \| -0.1776 \|
	\| p109m \| 109M \| 524M (t1_16) \| 5.3841 \| 5.5431 \| -0.1590 \|
	\| p223m \| 223M \| 131M (t1_64) \| 5.8612 \| 6.0407 \| -0.1795 \|
	\| p407m \| 407M \| 65.6M (t1_128) \| 6.8377 \| 7.1486 \| -0.3109 \|
	\| p407m \| 407M \| 262M (t1_32) \| 4.5280 \| 5.1510 \| -0.6230 \|
	\| p407m \| 407M \| 524M (t1_16) \| 4.4119 \| 4.5080 \| -0.0961 \|
	\| p407m \| 407M \| 1.05B (t1_8) \| 3.4614 \| 3.9945 \| -0.5331 \|
	\| p407m \| 407M \| 2.10B (t1_4) \| 3.5738 \| 3.6206 \| -0.0468 \|
	\| p407m \| 407M \| 4.20B (t1_2) \| 3.3230 \| 3.3503 \| -0.0273 \|
	\| p407m \| 407M \| 8.39B (t1_1) \| 3.1236 \| -- \| -- \|
	\| p686m \| 686M \| 131M (t1_64) \| 5.5675 \| 5.7105 \| -0.1430 \|
	\| p686m \| 686M \| 262M (t1_32) \| 4.9169 \| 5.0321 \| -0.1152 \|
	\| p686m \| 686M \| 524M (t1_16) \| 4.2684 \| 4.3334 \| -0.0650 \|
	\| p686m \| 686M \| 1.05B (t1_8) \| 3.7796 \| 3.8389 \| -0.0593 \|
	\| p686m \| 686M \| 4.20B (t1_2) \| 3.3219 \| 3.2237 \| +0.0982 \|
	\| p686m \| 686M \| 65.6M (t1_128) \| 11.9171* \| 6.5685 \| -- \|
	\| p1083m \| 1.08B \| 65.6M (t1_128) \| 6.9223 \| 7.9453 \| -1.0230 \|
	\| p1083m \| 1.08B \| 262M (t1_32) \| 4.4833 \| 7.0858* \| -- \|
	\| p1083m \| 1.08B \| 524M (t1_16) \| 4.1417 \| 4.2060 \| -0.0643 \|
	\| p1083m \| 1.08B \| 1.05B (t1_8) \| 3.3901 \| 3.7304 \| -0.3403 \|
	\| p1083m \| 1.08B \| 4.20B (t1_2) \| 3.1223 \| 3.2614 \| -0.1391 \|
	\| p1083m \| 1.08B \| 8.39B (t1_1) \| 2.9269 \| -- \| -- \|
	\| p1621m \| 1.62B \| all \| NaN \| 3.30-6.54 \| -- \|
	\| p2324m \| 2.32B \| 262M (t1_32) \| 4.6533 \| 4.7401 \| -0.0868 \|
	\| p2324m \| 2.32B \| 524M (t1_16) \| 4.0000 \| 4.0594 \| -0.0594 \|
	\| p2324m \| 2.32B \| 4.20B (t1_2) \| 3.0077 \| 3.2524 \| -0.2447 \|
	\| p2324m \| 2.32B \| 8.39B (t1_1) \| 3.1524 \| -- \| -- \|

	\* Anomalous values due to training instability/divergence. p1621m hyp diverged entirely (all NaN).

	## Summary

	- Hyperbolic is consistently better in most matched comparisons (lower eval loss)
	- Average improvement: ~5-13% relative reduction in loss
	- Improvement is larger at medium token budgets (t1_32, t1_8) and diminishes at high token budgets (t1_2, t1_1)
	- Caveat: init difference (lm_head zeros vs std=0.02) confounds the comparison
	- Hyperbolic models show instability at larger sizes (p686m+ with t1_128, p1621m diverges entirely)

	## Repository Structure

	```
	checkpoints/
	qwen3/ # Euclidean models
	qwen3_p{SIZE}_t1_{N}_.../
	attempt{K}_{DATE}/
	checkpoint-{STEP}/ # Final checkpoint
	qwen3_hyp/ # Hyperbolic models
	qwen3_hyp_p{SIZE}_t1_{N}_.../
	attempt{K}_{DATE}/
	checkpoint-{STEP}/

	results/ # Training logs (trainer_state.json)
	scaling_law/owt/
	qwen3/owt_scaling_v3/
	qwen3_hyp/owt_scaling_v3/
	```

	## Experiment Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| Qwen3 (custom sizes) \|
	\| Vocab size \| 151,936 \|
	\| Context length \| 1,024 \|
	\| Dataset \| OpenWebText (8.39B tokens) \|
	\| Optimizer \| NanochatMuon (Muon + per-group AdamW) \|
	\| Muon targets \| 2D transformer matrices \|
	\| AdamW groups \| embed (lr=0.2scale), lm_head (lr=0.004scale), misc (lr=0.004*scale) \|
	\| LR scaling \| (d_model/768)^(-0.5) \|
	\| Precision \| bf16 mixed precision \|
	\| Infrastructure \| 4x NVIDIA H100 80GB, DeepSpeed ZeRO-2 \|

	## Citation

	```bibtex
	@misc{hyperscale2026,
	title={HyperScale: Scaling Laws for Hyperbolic Output Layers in Language Models},
	author={Jinrui Lin},
	year={2026},
	url={https://github.com/ObliviateRickLin/HyperScale}
	}
	```