datasysdev commited on
Commit
5f5d723
·
verified ·
1 Parent(s): eaa45c8

docs: update model card for 200M CoT training run

Browse files
Files changed (1) hide show
  1. README.md +174 -88
README.md CHANGED
@@ -1,136 +1,222 @@
1
  ---
2
- license: apache-2.0
 
3
  tags:
4
  - hyperbolic
5
  - lorentz
6
  - geometric-deep-learning
7
  - language-model
8
- - pretraining
9
- datasets:
10
- - wikimedia/wikipedia
11
- language:
12
- - en
13
- base_model:
14
- - Graph-and-Geometric-Learning/helm
15
  pipeline_tag: text-generation
 
 
 
 
16
  ---
17
 
18
- # HELM-D 130M: Hyperbolic Efficient Language Model
 
 
 
 
 
 
19
 
20
- A 130M parameter language model that operates entirely on the **Lorentz manifold** (hyperboloid model of hyperbolic space). All embeddings, attention, and optimization live in hyperbolic space — the model is geometrically native, not a Euclidean model with hyperbolic post-hoc modifications.
21
 
22
- Pretrained on NVIDIA H200 at **193K tokens/sec** using Flash Attention 2, selective BF16, and torch.compile optimizations.
23
 
24
- ## Architecture
25
 
26
  | Parameter | Value |
27
  |---|---|
28
- | Architecture | L6W384A6 (6 layers, width 384, 6 heads) |
29
- | Parameters | 130M |
30
- | Manifold | Lorentz (hyperboloid, curvature K=1) |
31
- | Tokenizer | Qwen3-30B-A3B (151,669 vocab) |
32
- | Context length | 2048 |
33
- | Attention | Flash Attention 2 (spatial-only with time reconstruction) |
34
- | Optimizer | RiemannianAdam (geoopt) |
 
 
 
 
 
 
 
 
 
 
35
 
36
- ## Training
37
 
38
- Pretrained on 100K English Wikipedia articles + 100K Python source files (~221M unique tokens, ~4 epochs). This is a **proof-of-concept checkpoint** — it validates the hyperbolic training pipeline but does not produce coherent text generation due to the small dataset size.
39
 
40
- ### Performance (H200)
41
 
42
- | Configuration | ms/step | tok/s | Speedup |
43
- |---|---|---|---|
44
- | Original FP32 | 5,966 | 43,917 | 1.0× |
45
- | + BF16 logits | 3,601 | 72,770 | 1.7× |
46
- | + FA2 (width=384) | 1,875 | 140,025 | 3.2× |
47
- | **+ torch.compile + python -O** | **1,357** | **193,000** | **4.4×** |
48
 
49
- ### Training Curve
50
 
51
- Loss stabilized around 6.5-7.0 after exhausting the 221M-token dataset (4+ epochs).
 
 
52
 
53
- ## Checkpoints
54
 
55
- | File | Step | Description |
 
 
56
  |---|---|---|
57
- | `h200_step2400.pt` | 2400 | End of first torch.compile run (stable, loss ~7.0) |
58
- | `h200_step4100.pt` | 4100 | Final checkpoint with all optimizations (-O flag, geoopt patch) |
 
 
 
 
 
59
 
60
- Each checkpoint contains:
61
- - `model_state_dict`: Full model weights (FP32, Lorentz manifold)
62
- - `optimizer_state_dict`: RiemannianAdam state
63
- - `global_step`: Training step counter
64
 
65
- ### Loading
66
 
67
  ```python
68
- import torch
69
- from helm.hypercore.manifolds import Lorentz
70
- from helm.modules.helm_d import LTransformerDecoder
71
-
72
- model = LTransformerDecoder(
73
- manifold_in=Lorentz(1.0),
74
- manifold_hidden=Lorentz(1.0),
75
- manifold_out=Lorentz(1.0),
76
- arch="L6W384A6",
77
- vocab_size=151669,
78
- context_length=2048,
79
- )
80
- ckpt = torch.load("h200_step4100.pt", map_location="cpu", weights_only=False)
81
- model.load_state_dict(ckpt["model_state_dict"], strict=False)
82
  ```
83
 
84
- ## Tokenizer Surgery
85
 
86
- The original HELM uses Llama-3.1 tokenizer (128K vocab). We transferred embeddings to the Qwen3-30B-A3B tokenizer (151K vocab) using **Lorentzian Fréchet Mean** — computing the geometric centroid on the hyperboloid for novel tokens by decomposing them into Llama sub-tokens and projecting via the Einstein midpoint.
87
 
88
- ## Key Optimizations
89
 
90
- - **Flash Attention 2**: Runs on spatial dimensions only (strips Lorentz time coordinate), reconstructs via manifold projection after attention.
91
- - **Selective BF16**: Only the output projection (Euclidean) uses BF16. All Lorentz operations remain FP32.
92
- - **python -O**: Strips 30+ `assert torch.isnan()` checks from the manifold code, eliminating GPU→CPU synchronization stalls.
93
- - **geoopt patch**: `torch.norm(p=2)` → `torch.linalg.vector_norm(ord=2)` for torch.compile compatibility.
94
- - **Width 384**: Aligned to 64-wide Tensor Core tiles (original was 390).
95
 
96
- ## Intended Use
 
 
97
 
98
- This checkpoint serves as a **seed for Network Morphism** — upscaling to 1B+ parameters by zero-padding Lorentz spatial dimensions and cloning transformer layers. The learned manifold geometry, token distributions, and attention patterns transfer to the larger model.
99
 
100
- ## Geometric Compromises
101
 
102
- - FA2 computes Euclidean dot products instead of Minkowski inner products (drops the -q₀k₀ term)
103
- - Periodic re-projection of embeddings onto the manifold every 100 steps
104
- - Einstein midpoint used instead of iterative Karcher mean for tokenizer surgery
 
 
105
 
106
- ## Citation
107
 
108
- Based on:
109
- ```bibtex
110
- @article{helm2024,
111
- title={Hyperbolic Efficient Language Models},
112
- author={Graph and Geometric Learning Lab},
113
- year={2024},
114
- url={https://github.com/Graph-and-Geometric-Learning/helm}
115
- }
 
 
 
 
 
 
 
 
 
 
 
116
  ```
117
 
118
- ## Source Code
119
 
120
- [unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)
 
 
121
 
122
- ## Roadmap: 1.37B Pretraining
 
 
 
 
 
123
 
124
- The 130M checkpoints in this repo are seeds for the **1.37B HELM-D** model (L24W1536A24), upscaled via Network Morphism:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
- 1. **Width 384→1536**: Zero-pad Lorentz spatial dims (manifold constraint preserved exactly)
127
- 2. **Depth 6→24 layers**: Interleaved cloning — repeats the full 6-layer pipeline 4× with residual scaling
128
- 3. **All linear weights**: Top-left corner placement in expanded matrices, remainder N(0, 0.001)
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
- The 1.37B model is currently training on **2B tokens from FineWeb-Edu** on a single NVIDIA H200.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
- ### Next Steps
133
 
134
- - **KL divergence distillation** from Qwen3-30B using Nebius SWE-agent trajectories (80K agentic tool-use sequences)
135
- - **Context extension** to 128K via NTK-RoPE scaling
136
- - **Fine-tuning** on agentic coding trajectories for downstream tool-use tasks
 
1
  ---
2
+ language:
3
+ - en
4
  tags:
5
  - hyperbolic
6
  - lorentz
7
  - geometric-deep-learning
8
  - language-model
9
+ - chain-of-thought
10
+ - reasoning
 
 
 
 
 
11
  pipeline_tag: text-generation
12
+ license: mit
13
+ datasets:
14
+ - open-thoughts/OpenThoughts-114k
15
+ - HuggingFaceTB/smollm-corpus
16
  ---
17
 
18
+ # HELM-D: Hyperbolic Chain-of-Thought Reasoning Engine
19
+
20
+ > Fork of [Graph-and-Geometric-Learning/helm](https://github.com/Graph-and-Geometric-Learning/helm) — a **200M parameter** fully hyperbolic transformer trained on NVIDIA H200 for structured reasoning.
21
+ >
22
+ > **Checkpoints**: [datasysdev/helm-d-130m-hyperbolic](https://huggingface.co/datasysdev/helm-d-130m-hyperbolic) on HuggingFace
23
+
24
+ All computations live on the [Lorentz manifold](https://en.wikipedia.org/wiki/Hyperboloid_model): $-x_0^2 + x_1^2 + \dots + x_d^2 = -1$. The model uses hyperbolic embeddings, Lorentzian attention, and Riemannian optimization — making it natively suited for hierarchical data like code ASTs, dependency trees, and chain-of-thought reasoning traces.
25
 
26
+ ---
27
 
28
+ ## Current Training Run
29
 
30
+ Training a **200M parameter** HELM-D from scratch on a multi-domain reasoning corpus:
31
 
32
  | Parameter | Value |
33
  |---|---|
34
+ | Architecture | `L16W768A12` (16 layers, 768 width, 12 heads) |
35
+ | Parameters | **200M** (175.8M Euclidean + 24.6M Hyperbolic) |
36
+ | Tokenizer | TinyLlama 32K (dense coverage, no dead tokens) |
37
+ | Context | 4096 tokens (full CoT traces fit in one pass) |
38
+ | Throughput | **130K tok/s** on single H200 |
39
+ | Optimizer | Dual-group RiemannianAdam (see below) |
40
+ | Learning Rate | 3e-4, cosine decay with 500-step warmup |
41
+ | Gradient Clip | 0.5 |
42
+ | Manifold | Lorentz $-x_0^2 + \|x\|^2 = -1$, verified at 1.0000±0.0000 |
43
+
44
+ ### Training Data (60/20/20 Mix)
45
+
46
+ | Domain | Weight | Source | Purpose |
47
+ |---|---|---|---|
48
+ | CoT Reasoning | 60% | [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | Math, code, science reasoning with `<think>` traces |
49
+ | Python Code | 20% | [SmolLM-Corpus python-edu](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | Educational Python |
50
+ | Text | 20% | [SmolLM-Corpus cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | General knowledge |
51
 
52
+ Streamed via `interleave_datasets` with a **512-chunk shuffle buffer** to prevent domain clustering (see Architecture Decisions below).
53
 
54
+ ---
55
 
56
+ ## Key Changes from Upstream HELM
57
 
58
+ ### 1. Tokenizer: Llama-3.1 TinyLlama 32K
 
 
 
 
 
59
 
60
+ The original HELM uses the Llama-3.1 tokenizer (128K vocab). We switched to **TinyLlama's 32K tokenizer** for the CoT training run:
61
 
62
+ - **Dense coverage**: No dead tokens every token gets trained
63
+ - **Smaller embedding matrix**: 32K × 768 vs 128K × 768 — significant VRAM savings
64
+ - **Better for small models**: 200M params can't support 128K vocab efficiently
65
 
66
+ ### 2. Architecture: L6W384A6 → L16W768A12
67
 
68
+ Scaled up from the original 31M parameter toy model to a **200M parameter** engine:
69
+
70
+ | | Original | Ours |
71
  |---|---|---|
72
+ | Layers | 6 | **16** |
73
+ | Width | 390 | **768** |
74
+ | Heads | 6 | **12** |
75
+ | Head dim | 65 | **64** (Tensor Core aligned) |
76
+ | Parameters | 31M | **200M** |
77
+
78
+ ### 3. Dual-Group Optimizer (Matching Original Authors)
79
 
80
+ The original HELM repo uses **two separate optimizers**: AdamW for Euclidean params and RiemannianAdam for hyperbolic params, with `weight_decay=0.0` on manifold parameters.
 
 
 
81
 
82
+ We implement this as a single RiemannianAdam with dual parameter groups:
83
 
84
  ```python
85
+ optimizer = RiemannianAdam([
86
+ {"params": euclidean_params, "weight_decay": 0.01}, # 175.8M params
87
+ {"params": hyperbolic_params, "weight_decay": 0.0}, # 24.6M params
88
+ ], lr=3e-4)
 
 
 
 
 
 
 
 
 
 
89
  ```
90
 
91
+ **Why**: Standard L2 weight decay pulls parameters toward the Euclidean origin `[0,0,...,0]`, which is **not on the Lorentz manifold**. Applying decay to manifold parameters causes the optimizer to constantly drag embeddings off the $-1$ surface, then the `expmap` projection violently snaps them back — destabilizing training.
92
 
93
+ ### 4. Shuffle Buffer Dataloader
94
 
95
+ The streaming `interleave_datasets` interleaves at the **document** level. Since OpenThoughts reasoning traces can be 4,000-16,000 tokens (1-4 consecutive 4096-token chunks), the model receives bursts of pure math followed by bursts of pure code — causing catastrophic loss spikes.
96
 
97
+ **Fix**: A 512-chunk shuffle buffer accumulates tokenized chunks before yielding, ensuring every batch is a representative mix of all 3 domains:
 
 
 
 
98
 
99
+ ```
100
+ Documents → Tokenize → Pack into 4096-token chunks → Buffer (512) → Shuffle → Yield to GPU
101
+ ```
102
 
103
+ This eliminated gradient spikes of 46+ and stabilized the loss descent.
104
 
105
+ ### 5. TF32 Tensor Core Acceleration
106
 
107
+ ```python
108
+ torch.backends.cuda.matmul.allow_tf32 = True
109
+ torch.backends.cudnn.allow_tf32 = True
110
+ torch.set_float32_matmul_precision("high")
111
+ ```
112
 
113
+ Throughput: **40K → 130K tok/s** (3.25× speedup). All upstream Lorentz operations remain in FP32 — only matmul operations use TF32's 10-bit mantissa through the Tensor Cores.
114
 
115
+ ### 6. LR Override on Checkpoint Resume
116
+
117
+ PyTorch's `optimizer.load_state_dict()` restores the learning rate from the checkpoint, silently overriding CLI arguments. We force the LR after restore:
118
+
119
+ ```python
120
+ for pg in optimizer.param_groups:
121
+ pg["lr"] = args.lr
122
+ pg["initial_lr"] = args.lr
123
+ ```
124
+
125
+ ---
126
+
127
+ ## Quick Start
128
+
129
+ ### Requirements
130
+
131
+ ```bash
132
+ pip install torch flash-attn --no-build-isolation
133
+ pip install geoopt transformers datasets
134
  ```
135
 
136
+ ### Training on H200
137
 
138
+ ```bash
139
+ export PYTHONPATH=/path/to/helm-src:$PYTHONPATH
140
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
141
 
142
+ # Fresh training
143
+ python3 -O train_cot.py \
144
+ --batch_size 16 --grad_accum 8 \
145
+ --lr 3e-4 --seq_len 4096 \
146
+ --save_dir /tmp/checkpoints/cot \
147
+ --log_every 1
148
 
149
+ # Resume from checkpoint
150
+ python3 -O train_cot.py \
151
+ --batch_size 16 --grad_accum 8 \
152
+ --lr 3e-4 --save_dir /tmp/checkpoints/cot \
153
+ --log_every 1 --resume
154
+ ```
155
+
156
+ ### Generation Test
157
+
158
+ ```bash
159
+ python3 test_gen.py --checkpoint /tmp/checkpoints/cot/cot_step5000.pt
160
+ ```
161
+
162
+ ---
163
+
164
+ ## Architecture Decisions
165
+
166
+ ### Gradient Clipping: 1.0 → 0.5
167
 
168
+ The original authors use `grad_clip=1.0` on a 6-layer model. At 16 layers, gradient variance compounds across 10 additional layers. Clip of 0.5 on 16 layers is physically equivalent to 1.0 on 6 layers.
169
+
170
+ ### LR Scaling: 4e-4 3e-4
171
+
172
+ The original authors use `lr=4e-4` on a 31M model. As parameter count and depth scale, optimal learning rates must decrease. 3e-4 is the correct scaling for 200M parameters.
173
+
174
+ ### Flash Attention 2
175
+
176
+ FA2 computes Euclidean dot products, but hyperbolic attention requires the Minkowski inner product $\langle x, y \rangle_{\mathcal{L}} = -x_0 y_0 + \sum x_i y_i$. We run FA2 on **spatial dimensions only** (strip the time coordinate), then reconstruct via manifold projection: $x_0 = \sqrt{\|x_{1:d}\|^2 + 1}$.
177
+
178
+ ### Periodic Re-projection
179
+
180
+ Embeddings are snapped back to $-x_0^2 + \|x\|^2 = -1$ every 100 steps to correct constraint drift from mixed-precision gradient updates.
181
+
182
+ ---
183
 
184
+ ## Files
185
+
186
+ | File | Description |
187
+ |---|---|
188
+ | `train_cot.py` | **Main training script** — 200M HELM-D with streaming 60/20/20 mix, shuffle buffer, dual optimizer |
189
+ | `test_gen.py` | Temperature sweep generation test with repetition penalty grid |
190
+ | `train_h200.py` | H200 pretraining with FA2, BF16, torch.compile (130M seed model) |
191
+ | `train_h200_130m.py` | 130M config (L6W384A6) for seed training |
192
+ | `tokenizer_surgery.py` | Llama→Qwen3 embedding transfer via Lorentzian Fréchet Mean |
193
+ | `upscale_130m_to_1b.py` | Network Morphism: 130M→1.37B (Lorentz zero-pad + layer cloning) |
194
+ | `setup_h200.sh` | H200 environment setup (CUDA, PyTorch, Flash Attention) |
195
+ | `helm/modules/helm_d.py` | HELM-D decoder with RoPE odd-dim fix, BF16 output projection |
196
+ | `helm/hypercore/` | Lorentz manifold operations, Riemannian optimizers |
197
+
198
+ ---
199
+
200
+ ## Known Issues
201
+
202
+ - **torch.compile modes**: `max-autotune` and `reduce-overhead` crash with CUDAGraphs in LorentzEmbeddings. Only default mode works.
203
+ - **geoopt + torch.compile**: Requires patching `torch.norm` → `torch.linalg.vector_norm` in geoopt's `lorentz/math.py`.
204
+ - **Tokenizer max length warnings**: TinyLlama tokenizer reports max_length=2048 but we use 4096 seq_len — this is harmless (we handle truncation ourselves).
205
+
206
+ ---
207
+
208
+ ## Citation
209
+
210
+ Based on:
211
+ ```bibtex
212
+ @article{he2025helm,
213
+ title={HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts},
214
+ author={He, Neil and Anand, Rishabh and Madhu, Hiren and Maatouk, Ali and Krishnaswamy, Smita and Tassiulas, Leandros and Yang, Menglin and Ying, Rex},
215
+ journal={arXiv preprint arXiv:2505.24722},
216
+ year={2025},
217
+ }
218
+ ```
219
 
220
+ ## License
221
 
222
+ MIT see [LICENSE](LICENSE).