language:
- en
tags:
- hyperbolic
- lorentz
- geometric-deep-learning
- language-model
- chain-of-thought
- reasoning
pipeline_tag: text-generation
license: mit
datasets:
- open-thoughts/OpenThoughts-114k
- HuggingFaceTB/smollm-corpus
HELM-D: Hyperbolic Chain-of-Thought Reasoning Engine
Fork of Graph-and-Geometric-Learning/helm β a 200M parameter fully hyperbolic transformer trained on NVIDIA H200 for structured reasoning.
Checkpoints: datasysdev/helm-d-130m-hyperbolic on HuggingFace
All computations live on the Lorentz manifold: $-x_0^2 + x_1^2 + \dots + x_d^2 = -1$. The model uses hyperbolic embeddings, Lorentzian attention, and Riemannian optimization β making it natively suited for hierarchical data like code ASTs, dependency trees, and chain-of-thought reasoning traces.
Current Training Run
Training a 200M parameter HELM-D from scratch on a multi-domain reasoning corpus:
| Parameter | Value |
|---|---|
| Architecture | L16W768A12 (16 layers, 768 width, 12 heads) |
| Parameters | 200M (175.8M Euclidean + 24.6M Hyperbolic) |
| Tokenizer | TinyLlama 32K (dense coverage, no dead tokens) |
| Context | 4096 tokens (full CoT traces fit in one pass) |
| Throughput | 130K tok/s on single H200 |
| Optimizer | Dual-group RiemannianAdam (see below) |
| Learning Rate | 3e-4, cosine decay with 500-step warmup |
| Gradient Clip | 0.5 |
| Manifold | Lorentz $-x_0^2 + |x|^2 = -1$, verified at 1.0000Β±0.0000 |
Training Data (60/20/20 Mix)
| Domain | Weight | Source | Purpose |
|---|---|---|---|
| CoT Reasoning | 60% | OpenThoughts-114k | Math, code, science reasoning with <think> traces |
| Python Code | 20% | SmolLM-Corpus python-edu | Educational Python |
| Text | 20% | SmolLM-Corpus cosmopedia-v2 | General knowledge |
Streamed via interleave_datasets with a 512-chunk shuffle buffer to prevent domain clustering (see Architecture Decisions below).
Key Changes from Upstream HELM
1. Tokenizer: Llama-3.1 β TinyLlama 32K
The original HELM uses the Llama-3.1 tokenizer (128K vocab). We switched to TinyLlama's 32K tokenizer for the CoT training run:
- Dense coverage: No dead tokens β every token gets trained
- Smaller embedding matrix: 32K Γ 768 vs 128K Γ 768 β significant VRAM savings
- Better for small models: 200M params can't support 128K vocab efficiently
2. Architecture: L6W384A6 β L16W768A12
Scaled up from the original 31M parameter toy model to a 200M parameter engine:
| Original | Ours | |
|---|---|---|
| Layers | 6 | 16 |
| Width | 390 | 768 |
| Heads | 6 | 12 |
| Head dim | 65 | 64 (Tensor Core aligned) |
| Parameters | 31M | 200M |
3. Dual-Group Optimizer (Matching Original Authors)
The original HELM repo uses two separate optimizers: AdamW for Euclidean params and RiemannianAdam for hyperbolic params, with weight_decay=0.0 on manifold parameters.
We implement this as a single RiemannianAdam with dual parameter groups:
optimizer = RiemannianAdam([
{"params": euclidean_params, "weight_decay": 0.01}, # 175.8M params
{"params": hyperbolic_params, "weight_decay": 0.0}, # 24.6M params
], lr=3e-4)
Why: Standard L2 weight decay pulls parameters toward the Euclidean origin [0,0,...,0], which is not on the Lorentz manifold. Applying decay to manifold parameters causes the optimizer to constantly drag embeddings off the $-1$ surface, then the expmap projection violently snaps them back β destabilizing training.
4. Shuffle Buffer Dataloader
The streaming interleave_datasets interleaves at the document level. Since OpenThoughts reasoning traces can be 4,000-16,000 tokens (1-4 consecutive 4096-token chunks), the model receives bursts of pure math followed by bursts of pure code β causing catastrophic loss spikes.
Fix: A 512-chunk shuffle buffer accumulates tokenized chunks before yielding, ensuring every batch is a representative mix of all 3 domains:
Documents β Tokenize β Pack into 4096-token chunks β Buffer (512) β Shuffle β Yield to GPU
This eliminated gradient spikes of 46+ and stabilized the loss descent.
5. TF32 Tensor Core Acceleration
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision("high")
Throughput: 40K β 130K tok/s (3.25Γ speedup). All upstream Lorentz operations remain in FP32 β only matmul operations use TF32's 10-bit mantissa through the Tensor Cores.
6. LR Override on Checkpoint Resume
PyTorch's optimizer.load_state_dict() restores the learning rate from the checkpoint, silently overriding CLI arguments. We force the LR after restore:
for pg in optimizer.param_groups:
pg["lr"] = args.lr
pg["initial_lr"] = args.lr
Quick Start
Requirements
pip install torch flash-attn --no-build-isolation
pip install geoopt transformers datasets
Training on H200
export PYTHONPATH=/path/to/helm-src:$PYTHONPATH
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Fresh training
python3 -O train_cot.py \
--batch_size 16 --grad_accum 8 \
--lr 3e-4 --seq_len 4096 \
--save_dir /tmp/checkpoints/cot \
--log_every 1
# Resume from checkpoint
python3 -O train_cot.py \
--batch_size 16 --grad_accum 8 \
--lr 3e-4 --save_dir /tmp/checkpoints/cot \
--log_every 1 --resume
Generation Test
python3 test_gen.py --checkpoint /tmp/checkpoints/cot/cot_step5000.pt
Architecture Decisions
Gradient Clipping: 1.0 β 0.5
The original authors use grad_clip=1.0 on a 6-layer model. At 16 layers, gradient variance compounds across 10 additional layers. Clip of 0.5 on 16 layers is physically equivalent to 1.0 on 6 layers.
LR Scaling: 4e-4 β 3e-4
The original authors use lr=4e-4 on a 31M model. As parameter count and depth scale, optimal learning rates must decrease. 3e-4 is the correct scaling for 200M parameters.
Flash Attention 2
FA2 computes Euclidean dot products, but hyperbolic attention requires the Minkowski inner product $\langle x, y \rangle_{\mathcal{L}} = -x_0 y_0 + \sum x_i y_i$. We run FA2 on spatial dimensions only (strip the time coordinate), then reconstruct via manifold projection: $x_0 = \sqrt{|x_{1:d}|^2 + 1}$.
Periodic Re-projection
Embeddings are snapped back to $-x_0^2 + |x|^2 = -1$ every 100 steps to correct constraint drift from mixed-precision gradient updates.
Files
| File | Description |
|---|---|
train_cot.py |
Main training script β 200M HELM-D with streaming 60/20/20 mix, shuffle buffer, dual optimizer |
test_gen.py |
Temperature sweep generation test with repetition penalty grid |
train_h200.py |
H200 pretraining with FA2, BF16, torch.compile (130M seed model) |
train_h200_130m.py |
130M config (L6W384A6) for seed training |
tokenizer_surgery.py |
LlamaβQwen3 embedding transfer via Lorentzian FrΓ©chet Mean |
upscale_130m_to_1b.py |
Network Morphism: 130Mβ1.37B (Lorentz zero-pad + layer cloning) |
setup_h200.sh |
H200 environment setup (CUDA, PyTorch, Flash Attention) |
helm/modules/helm_d.py |
HELM-D decoder with RoPE odd-dim fix, BF16 output projection |
helm/hypercore/ |
Lorentz manifold operations, Riemannian optimizers |
Known Issues
- torch.compile modes:
max-autotuneandreduce-overheadcrash with CUDAGraphs in LorentzEmbeddings. Only default mode works. - geoopt + torch.compile: Requires patching
torch.normβtorch.linalg.vector_normin geoopt'slorentz/math.py. - Tokenizer max length warnings: TinyLlama tokenizer reports max_length=2048 but we use 4096 seq_len β this is harmless (we handle truncation ourselves).
Citation
Based on:
@article{he2025helm,
title={HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts},
author={He, Neil and Anand, Rishabh and Madhu, Hiren and Maatouk, Ali and Krishnaswamy, Smita and Tassiulas, Leandros and Yang, Menglin and Ying, Rex},
journal={arXiv preprint arXiv:2505.24722},
year={2025},
}
License
MIT β see LICENSE.