datasysdev commited on
Commit
1dd3829
·
verified ·
1 Parent(s): 9d8f2ef

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - hyperbolic
5
+ - lorentz
6
+ - geometric-deep-learning
7
+ - language-model
8
+ - pretraining
9
+ datasets:
10
+ - wikimedia/wikipedia
11
+ language:
12
+ - en
13
+ base_model:
14
+ - Graph-and-Geometric-Learning/helm
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # HELM-D 130M: Hyperbolic Efficient Language Model
19
+
20
+ A 130M parameter language model that operates entirely on the **Lorentz manifold** (hyperboloid model of hyperbolic space). All embeddings, attention, and optimization live in hyperbolic space — the model is geometrically native, not a Euclidean model with hyperbolic post-hoc modifications.
21
+
22
+ Pretrained on NVIDIA H200 at **193K tokens/sec** using Flash Attention 2, selective BF16, and torch.compile optimizations.
23
+
24
+ ## Architecture
25
+
26
+ | Parameter | Value |
27
+ |---|---|
28
+ | Architecture | L6W384A6 (6 layers, width 384, 6 heads) |
29
+ | Parameters | 130M |
30
+ | Manifold | Lorentz (hyperboloid, curvature K=1) |
31
+ | Tokenizer | Qwen3-30B-A3B (151,669 vocab) |
32
+ | Context length | 2048 |
33
+ | Attention | Flash Attention 2 (spatial-only with time reconstruction) |
34
+ | Optimizer | RiemannianAdam (geoopt) |
35
+
36
+ ## Training
37
+
38
+ Pretrained on 100K English Wikipedia articles + 100K Python source files (~221M unique tokens, ~4 epochs). This is a **proof-of-concept checkpoint** — it validates the hyperbolic training pipeline but does not produce coherent text generation due to the small dataset size.
39
+
40
+ ### Performance (H200)
41
+
42
+ | Configuration | ms/step | tok/s | Speedup |
43
+ |---|---|---|---|
44
+ | Original FP32 | 5,966 | 43,917 | 1.0× |
45
+ | + BF16 logits | 3,601 | 72,770 | 1.7× |
46
+ | + FA2 (width=384) | 1,875 | 140,025 | 3.2× |
47
+ | **+ torch.compile + python -O** | **1,357** | **193,000** | **4.4×** |
48
+
49
+ ### Training Curve
50
+
51
+ Loss stabilized around 6.5-7.0 after exhausting the 221M-token dataset (4+ epochs).
52
+
53
+ ## Checkpoints
54
+
55
+ | File | Step | Description |
56
+ |---|---|---|
57
+ | `h200_step2400.pt` | 2400 | End of first torch.compile run (stable, loss ~7.0) |
58
+ | `h200_step4100.pt` | 4100 | Final checkpoint with all optimizations (-O flag, geoopt patch) |
59
+
60
+ Each checkpoint contains:
61
+ - `model_state_dict`: Full model weights (FP32, Lorentz manifold)
62
+ - `optimizer_state_dict`: RiemannianAdam state
63
+ - `global_step`: Training step counter
64
+
65
+ ### Loading
66
+
67
+ ```python
68
+ import torch
69
+ from helm.hypercore.manifolds import Lorentz
70
+ from helm.modules.helm_d import LTransformerDecoder
71
+
72
+ model = LTransformerDecoder(
73
+ manifold_in=Lorentz(1.0),
74
+ manifold_hidden=Lorentz(1.0),
75
+ manifold_out=Lorentz(1.0),
76
+ arch="L6W384A6",
77
+ vocab_size=151669,
78
+ context_length=2048,
79
+ )
80
+ ckpt = torch.load("h200_step4100.pt", map_location="cpu", weights_only=False)
81
+ model.load_state_dict(ckpt["model_state_dict"], strict=False)
82
+ ```
83
+
84
+ ## Tokenizer Surgery
85
+
86
+ The original HELM uses Llama-3.1 tokenizer (128K vocab). We transferred embeddings to the Qwen3-30B-A3B tokenizer (151K vocab) using **Lorentzian Fréchet Mean** — computing the geometric centroid on the hyperboloid for novel tokens by decomposing them into Llama sub-tokens and projecting via the Einstein midpoint.
87
+
88
+ ## Key Optimizations
89
+
90
+ - **Flash Attention 2**: Runs on spatial dimensions only (strips Lorentz time coordinate), reconstructs via manifold projection after attention.
91
+ - **Selective BF16**: Only the output projection (Euclidean) uses BF16. All Lorentz operations remain FP32.
92
+ - **python -O**: Strips 30+ `assert torch.isnan()` checks from the manifold code, eliminating GPU→CPU synchronization stalls.
93
+ - **geoopt patch**: `torch.norm(p=2)` → `torch.linalg.vector_norm(ord=2)` for torch.compile compatibility.
94
+ - **Width 384**: Aligned to 64-wide Tensor Core tiles (original was 390).
95
+
96
+ ## Intended Use
97
+
98
+ This checkpoint serves as a **seed for Network Morphism** — upscaling to 1B+ parameters by zero-padding Lorentz spatial dimensions and cloning transformer layers. The learned manifold geometry, token distributions, and attention patterns transfer to the larger model.
99
+
100
+ ## Geometric Compromises
101
+
102
+ - FA2 computes Euclidean dot products instead of Minkowski inner products (drops the -q₀k₀ term)
103
+ - Periodic re-projection of embeddings onto the manifold every 100 steps
104
+ - Einstein midpoint used instead of iterative Karcher mean for tokenizer surgery
105
+
106
+ ## Citation
107
+
108
+ Based on:
109
+ ```bibtex
110
+ @article{helm2024,
111
+ title={Hyperbolic Efficient Language Models},
112
+ author={Graph and Geometric Learning Lab},
113
+ year={2024},
114
+ url={https://github.com/Graph-and-Geometric-Learning/helm}
115
+ }
116
+ ```
117
+
118
+ ## Source Code
119
+
120
+ [unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)