datasysdev
/

helm-d-130m-hyperbolic

+---
+license: apache-2.0
+tags:
+  - hyperbolic
+  - lorentz
+  - geometric-deep-learning
+  - language-model
+  - pretraining
+datasets:
+  - wikimedia/wikipedia
+language:
+  - en
+base_model:
+  - Graph-and-Geometric-Learning/helm
+pipeline_tag: text-generation
+---
+# HELM-D 130M: Hyperbolic Efficient Language Model
+A 130M parameter language model that operates entirely on the **Lorentz manifold** (hyperboloid model of hyperbolic space). All embeddings, attention, and optimization live in hyperbolic space — the model is geometrically native, not a Euclidean model with hyperbolic post-hoc modifications.
+Pretrained on NVIDIA H200 at **193K tokens/sec** using Flash Attention 2, selective BF16, and torch.compile optimizations.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Architecture | L6W384A6 (6 layers, width 384, 6 heads) |
+| Parameters | 130M |
+| Manifold | Lorentz (hyperboloid, curvature K=1) |
+| Tokenizer | Qwen3-30B-A3B (151,669 vocab) |
+| Context length | 2048 |
+| Attention | Flash Attention 2 (spatial-only with time reconstruction) |
+| Optimizer | RiemannianAdam (geoopt) |
+## Training
+Pretrained on 100K English Wikipedia articles + 100K Python source files (~221M unique tokens, ~4 epochs). This is a **proof-of-concept checkpoint** — it validates the hyperbolic training pipeline but does not produce coherent text generation due to the small dataset size.
+### Performance (H200)
+| Configuration | ms/step | tok/s | Speedup |
+|---|---|---|---|
+| Original FP32 | 5,966 | 43,917 | 1.0× |
+| + BF16 logits | 3,601 | 72,770 | 1.7× |
+| + FA2 (width=384) | 1,875 | 140,025 | 3.2× |
+| **+ torch.compile + python -O** | **1,357** | **193,000** | **4.4×** |
+### Training Curve
+Loss stabilized around 6.5-7.0 after exhausting the 221M-token dataset (4+ epochs).
+## Checkpoints
+| File | Step | Description |
+|---|---|---|
+| `h200_step2400.pt` | 2400 | End of first torch.compile run (stable, loss ~7.0) |
+| `h200_step4100.pt` | 4100 | Final checkpoint with all optimizations (-O flag, geoopt patch) |
+Each checkpoint contains:
+- `model_state_dict`: Full model weights (FP32, Lorentz manifold)
+- `optimizer_state_dict`: RiemannianAdam state
+- `global_step`: Training step counter
+### Loading
+```python
+import torch
+from helm.hypercore.manifolds import Lorentz
+from helm.modules.helm_d import LTransformerDecoder
+model = LTransformerDecoder(
+    manifold_in=Lorentz(1.0),
+    manifold_hidden=Lorentz(1.0),
+    manifold_out=Lorentz(1.0),
+    arch="L6W384A6",
+    vocab_size=151669,
+    context_length=2048,
+)
+ckpt = torch.load("h200_step4100.pt", map_location="cpu", weights_only=False)
+model.load_state_dict(ckpt["model_state_dict"], strict=False)
+```
+## Tokenizer Surgery
+The original HELM uses Llama-3.1 tokenizer (128K vocab). We transferred embeddings to the Qwen3-30B-A3B tokenizer (151K vocab) using **Lorentzian Fréchet Mean** — computing the geometric centroid on the hyperboloid for novel tokens by decomposing them into Llama sub-tokens and projecting via the Einstein midpoint.
+## Key Optimizations
+- **Flash Attention 2**: Runs on spatial dimensions only (strips Lorentz time coordinate), reconstructs via manifold projection after attention.
+- **Selective BF16**: Only the output projection (Euclidean) uses BF16. All Lorentz operations remain FP32.
+- **python -O**: Strips 30+ `assert torch.isnan()` checks from the manifold code, eliminating GPU→CPU synchronization stalls.
+- **geoopt patch**: `torch.norm(p=2)` → `torch.linalg.vector_norm(ord=2)` for torch.compile compatibility.
+- **Width 384**: Aligned to 64-wide Tensor Core tiles (original was 390).
+## Intended Use
+This checkpoint serves as a **seed for Network Morphism** — upscaling to 1B+ parameters by zero-padding Lorentz spatial dimensions and cloning transformer layers. The learned manifold geometry, token distributions, and attention patterns transfer to the larger model.
+## Geometric Compromises
+- FA2 computes Euclidean dot products instead of Minkowski inner products (drops the -q₀k₀ term)
+- Periodic re-projection of embeddings onto the manifold every 100 steps
+- Einstein midpoint used instead of iterative Karcher mean for tokenizer surgery
+## Citation
+Based on:
+```bibtex
+@article{helm2024,
+  title={Hyperbolic Efficient Language Models},
+  author={Graph and Geometric Learning Lab},
+  year={2024},
+  url={https://github.com/Graph-and-Geometric-Learning/helm}
+}
+```
+## Source Code
+[unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)