docs: update model card for 200M CoT training run

5f5d723 verified 5 days ago

8.63 kB

	---
	language:
	- en
	tags:
	- hyperbolic
	- lorentz
	- geometric-deep-learning
	- language-model
	- chain-of-thought
	- reasoning
	pipeline_tag: text-generation
	license: mit
	datasets:
	- open-thoughts/OpenThoughts-114k
	- HuggingFaceTB/smollm-corpus
	---

	# HELM-D: Hyperbolic Chain-of-Thought Reasoning Engine

	> Fork of [Graph-and-Geometric-Learning/helm](https://github.com/Graph-and-Geometric-Learning/helm) — a 200M parameter fully hyperbolic transformer trained on NVIDIA H200 for structured reasoning.
	>
	> Checkpoints: [datasysdev/helm-d-130m-hyperbolic](https://huggingface.co/datasysdev/helm-d-130m-hyperbolic) on HuggingFace

	All computations live on the [Lorentz manifold](https://en.wikipedia.org/wiki/Hyperboloid_model): $-x_0^2 + x_1^2 + \dots + x_d^2 = -1$. The model uses hyperbolic embeddings, Lorentzian attention, and Riemannian optimization — making it natively suited for hierarchical data like code ASTs, dependency trees, and chain-of-thought reasoning traces.

	---

	## Current Training Run

	Training a 200M parameter HELM-D from scratch on a multi-domain reasoning corpus:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| `L16W768A12` (16 layers, 768 width, 12 heads) \|
	\| Parameters \| 200M (175.8M Euclidean + 24.6M Hyperbolic) \|
	\| Tokenizer \| TinyLlama 32K (dense coverage, no dead tokens) \|
	\| Context \| 4096 tokens (full CoT traces fit in one pass) \|
	\| Throughput \| 130K tok/s on single H200 \|
	\| Optimizer \| Dual-group RiemannianAdam (see below) \|
	\| Learning Rate \| 3e-4, cosine decay with 500-step warmup \|
	\| Gradient Clip \| 0.5 \|
	\| Manifold \| Lorentz $-x_0^2 + \\|x\\|^2 = -1$, verified at 1.0000±0.0000 \|

	### Training Data (60/20/20 Mix)

	\| Domain \| Weight \| Source \| Purpose \|
	\|---\|---\|---\|---\|
	\| CoT Reasoning \| 60% \| [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) \| Math, code, science reasoning with `<think>` traces \|
	\| Python Code \| 20% \| [SmolLM-Corpus python-edu](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) \| Educational Python \|
	\| Text \| 20% \| [SmolLM-Corpus cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) \| General knowledge \|

	Streamed via `interleave_datasets` with a 512-chunk shuffle buffer to prevent domain clustering (see Architecture Decisions below).

	---

	## Key Changes from Upstream HELM

	### 1. Tokenizer: Llama-3.1 → TinyLlama 32K

	The original HELM uses the Llama-3.1 tokenizer (128K vocab). We switched to TinyLlama's 32K tokenizer for the CoT training run:

	- Dense coverage: No dead tokens — every token gets trained
	- Smaller embedding matrix: 32K × 768 vs 128K × 768 — significant VRAM savings
	- Better for small models: 200M params can't support 128K vocab efficiently

	### 2. Architecture: L6W384A6 → L16W768A12

	Scaled up from the original 31M parameter toy model to a 200M parameter engine:

	\| \| Original \| Ours \|
	\|---\|---\|---\|
	\| Layers \| 6 \| 16 \|
	\| Width \| 390 \| 768 \|
	\| Heads \| 6 \| 12 \|
	\| Head dim \| 65 \| 64 (Tensor Core aligned) \|
	\| Parameters \| 31M \| 200M \|

	### 3. Dual-Group Optimizer (Matching Original Authors)

	The original HELM repo uses two separate optimizers: AdamW for Euclidean params and RiemannianAdam for hyperbolic params, with `weight_decay=0.0` on manifold parameters.

	We implement this as a single RiemannianAdam with dual parameter groups:

	```python
	optimizer = RiemannianAdam([
	{"params": euclidean_params, "weight_decay": 0.01}, # 175.8M params
	{"params": hyperbolic_params, "weight_decay": 0.0}, # 24.6M params
	], lr=3e-4)
	```

	Why: Standard L2 weight decay pulls parameters toward the Euclidean origin `[0,0,...,0]`, which is not on the Lorentz manifold. Applying decay to manifold parameters causes the optimizer to constantly drag embeddings off the $-1$ surface, then the `expmap` projection violently snaps them back — destabilizing training.

	### 4. Shuffle Buffer Dataloader

	The streaming `interleave_datasets` interleaves at the document level. Since OpenThoughts reasoning traces can be 4,000-16,000 tokens (1-4 consecutive 4096-token chunks), the model receives bursts of pure math followed by bursts of pure code — causing catastrophic loss spikes.

	Fix: A 512-chunk shuffle buffer accumulates tokenized chunks before yielding, ensuring every batch is a representative mix of all 3 domains:

	```
	Documents → Tokenize → Pack into 4096-token chunks → Buffer (512) → Shuffle → Yield to GPU
	```

	This eliminated gradient spikes of 46+ and stabilized the loss descent.

	### 5. TF32 Tensor Core Acceleration

	```python
	torch.backends.cuda.matmul.allow_tf32 = True
	torch.backends.cudnn.allow_tf32 = True
	torch.set_float32_matmul_precision("high")
	```

	Throughput: 40K → 130K tok/s (3.25× speedup). All upstream Lorentz operations remain in FP32 — only matmul operations use TF32's 10-bit mantissa through the Tensor Cores.

	### 6. LR Override on Checkpoint Resume

	PyTorch's `optimizer.load_state_dict()` restores the learning rate from the checkpoint, silently overriding CLI arguments. We force the LR after restore:

	```python
	for pg in optimizer.param_groups:
	pg["lr"] = args.lr
	pg["initial_lr"] = args.lr
	```

	---

	## Quick Start

	### Requirements

	```bash
	pip install torch flash-attn --no-build-isolation
	pip install geoopt transformers datasets
	```

	### Training on H200

	```bash
	export PYTHONPATH=/path/to/helm-src:$PYTHONPATH
	export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

	# Fresh training
	python3 -O train_cot.py \
	--batch_size 16 --grad_accum 8 \
	--lr 3e-4 --seq_len 4096 \
	--save_dir /tmp/checkpoints/cot \
	--log_every 1

	# Resume from checkpoint
	python3 -O train_cot.py \
	--batch_size 16 --grad_accum 8 \
	--lr 3e-4 --save_dir /tmp/checkpoints/cot \
	--log_every 1 --resume
	```

	### Generation Test

	```bash
	python3 test_gen.py --checkpoint /tmp/checkpoints/cot/cot_step5000.pt
	```

	---

	## Architecture Decisions

	### Gradient Clipping: 1.0 → 0.5

	The original authors use `grad_clip=1.0` on a 6-layer model. At 16 layers, gradient variance compounds across 10 additional layers. Clip of 0.5 on 16 layers is physically equivalent to 1.0 on 6 layers.

	### LR Scaling: 4e-4 → 3e-4

	The original authors use `lr=4e-4` on a 31M model. As parameter count and depth scale, optimal learning rates must decrease. 3e-4 is the correct scaling for 200M parameters.

	### Flash Attention 2

	FA2 computes Euclidean dot products, but hyperbolic attention requires the Minkowski inner product $\langle x, y \rangle_{\mathcal{L}} = -x_0 y_0 + \sum x_i y_i$. We run FA2 on spatial dimensions only (strip the time coordinate), then reconstruct via manifold projection: $x_0 = \sqrt{\\|x_{1:d}\\|^2 + 1}$.

	### Periodic Re-projection

	Embeddings are snapped back to $-x_0^2 + \\|x\\|^2 = -1$ every 100 steps to correct constraint drift from mixed-precision gradient updates.

	---

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `train_cot.py` \| Main training script — 200M HELM-D with streaming 60/20/20 mix, shuffle buffer, dual optimizer \|
	\| `test_gen.py` \| Temperature sweep generation test with repetition penalty grid \|
	\| `train_h200.py` \| H200 pretraining with FA2, BF16, torch.compile (130M seed model) \|
	\| `train_h200_130m.py` \| 130M config (L6W384A6) for seed training \|
	\| `tokenizer_surgery.py` \| Llama→Qwen3 embedding transfer via Lorentzian Fréchet Mean \|
	\| `upscale_130m_to_1b.py` \| Network Morphism: 130M→1.37B (Lorentz zero-pad + layer cloning) \|
	\| `setup_h200.sh` \| H200 environment setup (CUDA, PyTorch, Flash Attention) \|
	\| `helm/modules/helm_d.py` \| HELM-D decoder with RoPE odd-dim fix, BF16 output projection \|
	\| `helm/hypercore/` \| Lorentz manifold operations, Riemannian optimizers \|

	---

	## Known Issues

	- torch.compile modes: `max-autotune` and `reduce-overhead` crash with CUDAGraphs in LorentzEmbeddings. Only default mode works.
	- geoopt + torch.compile: Requires patching `torch.norm` → `torch.linalg.vector_norm` in geoopt's `lorentz/math.py`.
	- Tokenizer max length warnings: TinyLlama tokenizer reports max_length=2048 but we use 4096 seq_len — this is harmless (we handle truncation ourselves).

	---

	## Citation

	Based on:
	```bibtex
	@article{he2025helm,
	title={HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts},
	author={He, Neil and Anand, Rishabh and Madhu, Hiren and Maatouk, Ali and Krishnaswamy, Smita and Tassiulas, Leandros and Yang, Menglin and Ying, Rex},
	journal={arXiv preprint arXiv:2505.24722},
	year={2025},
	}
	```

	## License

	MIT — see [LICENSE](LICENSE).