Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -118,3 +118,19 @@ Based on:
|
|
| 118 |
## Source Code
|
| 119 |
|
| 120 |
[unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
## Source Code
|
| 119 |
|
| 120 |
[unixsysdev/helm (h200-optimizations branch)](https://github.com/unixsysdev/helm/tree/h200-optimizations)
|
| 121 |
+
|
| 122 |
+
## Roadmap: 1.37B Pretraining
|
| 123 |
+
|
| 124 |
+
The 130M checkpoints in this repo are seeds for the **1.37B HELM-D** model (L24W1536A24), upscaled via Network Morphism:
|
| 125 |
+
|
| 126 |
+
1. **Width 384→1536**: Zero-pad Lorentz spatial dims (manifold constraint preserved exactly)
|
| 127 |
+
2. **Depth 6→24 layers**: Interleaved cloning — repeats the full 6-layer pipeline 4× with residual scaling
|
| 128 |
+
3. **All linear weights**: Top-left corner placement in expanded matrices, remainder N(0, 0.001)
|
| 129 |
+
|
| 130 |
+
The 1.37B model is currently training on **2B tokens from FineWeb-Edu** on a single NVIDIA H200.
|
| 131 |
+
|
| 132 |
+
### Next Steps
|
| 133 |
+
|
| 134 |
+
- **KL divergence distillation** from Qwen3-30B using Nebius SWE-agent trajectories (80K agentic tool-use sequences)
|
| 135 |
+
- **Context extension** to 128K via NTK-RoPE scaling
|
| 136 |
+
- **Fine-tuning** on agentic coding trajectories for downstream tool-use tasks
|