docs: update model card for 200M CoT training run

5f5d723 verified 5 days ago

8.63 kB

language:
  - en
tags:
  - hyperbolic
  - lorentz
  - geometric-deep-learning
  - language-model
  - chain-of-thought
  - reasoning
pipeline_tag: text-generation
license: mit
datasets:
  - open-thoughts/OpenThoughts-114k
  - HuggingFaceTB/smollm-corpus

HELM-D: Hyperbolic Chain-of-Thought Reasoning Engine

Fork of Graph-and-Geometric-Learning/helm — a 200M parameter fully hyperbolic transformer trained on NVIDIA H200 for structured reasoning.

Checkpoints: datasysdev/helm-d-130m-hyperbolic on HuggingFace

All computations live on the Lorentz manifold: $-x_0^2 + x_1^2 + \dots + x_d^2 = -1$. The model uses hyperbolic embeddings, Lorentzian attention, and Riemannian optimization — making it natively suited for hierarchical data like code ASTs, dependency trees, and chain-of-thought reasoning traces.

Current Training Run

Training a 200M parameter HELM-D from scratch on a multi-domain reasoning corpus:

Parameter	Value
Architecture	`L16W768A12` (16 layers, 768 width, 12 heads)
Parameters	200M (175.8M Euclidean + 24.6M Hyperbolic)
Tokenizer	TinyLlama 32K (dense coverage, no dead tokens)
Context	4096 tokens (full CoT traces fit in one pass)
Throughput	130K tok/s on single H200
Optimizer	Dual-group RiemannianAdam (see below)
Learning Rate	3e-4, cosine decay with 500-step warmup
Gradient Clip	0.5
Manifold	Lorentz $-x_0^2 + \|x\|^2 = -1$, verified at 1.0000±0.0000

Training Data (60/20/20 Mix)

Domain	Weight	Source	Purpose
CoT Reasoning	60%	OpenThoughts-114k	Math, code, science reasoning with `<think>` traces
Python Code	20%	SmolLM-Corpus python-edu	Educational Python
Text	20%	SmolLM-Corpus cosmopedia-v2	General knowledge

Streamed via interleave_datasets with a 512-chunk shuffle buffer to prevent domain clustering (see Architecture Decisions below).

Key Changes from Upstream HELM

1. Tokenizer: Llama-3.1 → TinyLlama 32K

The original HELM uses the Llama-3.1 tokenizer (128K vocab). We switched to TinyLlama's 32K tokenizer for the CoT training run:

Dense coverage: No dead tokens — every token gets trained
Smaller embedding matrix: 32K × 768 vs 128K × 768 — significant VRAM savings
Better for small models: 200M params can't support 128K vocab efficiently

2. Architecture: L6W384A6 → L16W768A12

Scaled up from the original 31M parameter toy model to a 200M parameter engine:

	Original	Ours
Layers	6	16
Width	390	768
Heads	6	12
Head dim	65	64 (Tensor Core aligned)
Parameters	31M	200M

3. Dual-Group Optimizer (Matching Original Authors)

The original HELM repo uses two separate optimizers: AdamW for Euclidean params and RiemannianAdam for hyperbolic params, with weight_decay=0.0 on manifold parameters.

We implement this as a single RiemannianAdam with dual parameter groups:

optimizer = RiemannianAdam([
    {"params": euclidean_params, "weight_decay": 0.01},   # 175.8M params
    {"params": hyperbolic_params, "weight_decay": 0.0},   # 24.6M params
], lr=3e-4)

Why: Standard L2 weight decay pulls parameters toward the Euclidean origin [0,0,...,0], which is not on the Lorentz manifold. Applying decay to manifold parameters causes the optimizer to constantly drag embeddings off the $-1$ surface, then the expmap projection violently snaps them back — destabilizing training.

4. Shuffle Buffer Dataloader

The streaming interleave_datasets interleaves at the document level. Since OpenThoughts reasoning traces can be 4,000-16,000 tokens (1-4 consecutive 4096-token chunks), the model receives bursts of pure math followed by bursts of pure code — causing catastrophic loss spikes.

Fix: A 512-chunk shuffle buffer accumulates tokenized chunks before yielding, ensuring every batch is a representative mix of all 3 domains:

Documents → Tokenize → Pack into 4096-token chunks → Buffer (512) → Shuffle → Yield to GPU

This eliminated gradient spikes of 46+ and stabilized the loss descent.

5. TF32 Tensor Core Acceleration

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision("high")

Throughput: 40K → 130K tok/s (3.25× speedup). All upstream Lorentz operations remain in FP32 — only matmul operations use TF32's 10-bit mantissa through the Tensor Cores.

6. LR Override on Checkpoint Resume

PyTorch's optimizer.load_state_dict() restores the learning rate from the checkpoint, silently overriding CLI arguments. We force the LR after restore:

for pg in optimizer.param_groups:
    pg["lr"] = args.lr
    pg["initial_lr"] = args.lr

Quick Start

Requirements

pip install torch flash-attn --no-build-isolation
pip install geoopt transformers datasets

Training on H200

export PYTHONPATH=/path/to/helm-src:$PYTHONPATH
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Fresh training
python3 -O train_cot.py \
    --batch_size 16 --grad_accum 8 \
    --lr 3e-4 --seq_len 4096 \
    --save_dir /tmp/checkpoints/cot \
    --log_every 1

# Resume from checkpoint
python3 -O train_cot.py \
    --batch_size 16 --grad_accum 8 \
    --lr 3e-4 --save_dir /tmp/checkpoints/cot \
    --log_every 1 --resume

Generation Test

python3 test_gen.py --checkpoint /tmp/checkpoints/cot/cot_step5000.pt

Architecture Decisions

Gradient Clipping: 1.0 → 0.5

The original authors use grad_clip=1.0 on a 6-layer model. At 16 layers, gradient variance compounds across 10 additional layers. Clip of 0.5 on 16 layers is physically equivalent to 1.0 on 6 layers.

LR Scaling: 4e-4 → 3e-4

The original authors use lr=4e-4 on a 31M model. As parameter count and depth scale, optimal learning rates must decrease. 3e-4 is the correct scaling for 200M parameters.

Flash Attention 2

FA2 computes Euclidean dot products, but hyperbolic attention requires the Minkowski inner product $\langle x, y \rangle_{\mathcal{L}} = -x_0 y_0 + \sum x_i y_i$. We run FA2 on spatial dimensions only (strip the time coordinate), then reconstruct via manifold projection: $x_0 = \sqrt{|x_{1:d}|^2 + 1}$.

Periodic Re-projection

Embeddings are snapped back to $-x_0^2 + |x|^2 = -1$ every 100 steps to correct constraint drift from mixed-precision gradient updates.

Files

File	Description
`train_cot.py`	Main training script — 200M HELM-D with streaming 60/20/20 mix, shuffle buffer, dual optimizer
`test_gen.py`	Temperature sweep generation test with repetition penalty grid
`train_h200.py`	H200 pretraining with FA2, BF16, torch.compile (130M seed model)
`train_h200_130m.py`	130M config (L6W384A6) for seed training
`tokenizer_surgery.py`	Llama→Qwen3 embedding transfer via Lorentzian Fréchet Mean
`upscale_130m_to_1b.py`	Network Morphism: 130M→1.37B (Lorentz zero-pad + layer cloning)
`setup_h200.sh`	H200 environment setup (CUDA, PyTorch, Flash Attention)
`helm/modules/helm_d.py`	HELM-D decoder with RoPE odd-dim fix, BF16 output projection
`helm/hypercore/`	Lorentz manifold operations, Riemannian optimizers

Known Issues

torch.compile modes: max-autotune and reduce-overhead crash with CUDAGraphs in LorentzEmbeddings. Only default mode works.
geoopt + torch.compile: Requires patching torch.norm → torch.linalg.vector_norm in geoopt's lorentz/math.py.
Tokenizer max length warnings: TinyLlama tokenizer reports max_length=2048 but we use 4096 seq_len — this is harmless (we handle truncation ourselves).

Citation

Based on:

@article{he2025helm,
  title={HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts},
  author={He, Neil and Anand, Rishabh and Madhu, Hiren and Maatouk, Ali and Krishnaswamy, Smita and Tassiulas, Leandros and Yang, Menglin and Ying, Rex},
  journal={arXiv preprint arXiv:2505.24722},
  year={2025},
}

License

MIT — see LICENSE.