datasysdev
/

helm-d-130m-hyperbolic

Text Generation

geometric-deep-learning

chain-of-thought

Model card Files Files and versions

datasysdev commited on 6 days ago

Commit

eaa45c8

·

verified ·

1 Parent(s): 3316bf4

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +16 -0

README.md CHANGED Viewed

@@ -118,3 +118,19 @@ Based on:
 ## Source Code
 [unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)

 ## Source Code
 [unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)
+## Roadmap: 1.37B Pretraining
+The 130M checkpoints in this repo are seeds for the **1.37B HELM-D** model (L24W1536A24), upscaled via Network Morphism:
+1. **Width 384→1536**: Zero-pad Lorentz spatial dims (manifold constraint preserved exactly)
+2. **Depth 6→24 layers**: Interleaved cloning — repeats the full 6-layer pipeline 4× with residual scaling
+3. **All linear weights**: Top-left corner placement in expanded matrices, remainder N(0, 0.001)
+The 1.37B model is currently training on **2B tokens from FineWeb-Edu** on a single NVIDIA H200.
+### Next Steps
+- **KL divergence distillation** from Qwen3-30B using Nebius SWE-agent trajectories (80K agentic tool-use sequences)
+- **Context extension** to 128K via NTK-RoPE scaling
+- **Fine-tuning** on agentic coding trajectories for downstream tool-use tasks