Sparse-First Trained Transformer (dim=2048)
A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.
Architecture
- dim: 2048
- layers: 4
- heads: 64 (GQA, 32 KV heads)
- FFN: 4096
- vocab: 256 (byte-level)
- params: 152M
Training
Trained with the Helix DNA optimizer on RTX 5090:
- 74 steps/s at dim=2048
- gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
- Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
- Immune system: automatic checkpoint at loss floors, revert on rebound
Multi-GPU
On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):
- 21.7 steps/s at dim=4096 — 1.54x faster than PyTorch DDP
Usage
brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"
Paper
Sparse-First Training: A Biologically-Inspired Framework
Framework
- Downloads last month
- 14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support