--- license: mit tags: - sparse-first - helix - mongoose language: - en --- # Sparse-First Trained Transformer (dim=2048) A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model. ## Architecture - **dim**: 2048 - **layers**: 4 - **heads**: 64 (GQA, 32 KV heads) - **FFN**: 4096 - **vocab**: 256 (byte-level) - **params**: 152M ## Training Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090: - **74 steps/s** at dim=2048 - gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer - Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback - Immune system: automatic checkpoint at loss floors, revert on rebound ## Multi-GPU On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism): - **21.7 steps/s** at dim=4096 — 1.54x faster than PyTorch DDP ## Usage ```bash brew install open-ai-org/tap/ai ai pull open-ai-org/sparse-first-2048 ai infer sparse-first-2048 "Hello" ``` ## Paper [Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md) ## Framework - [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine - [ai](https://github.com/open-ai-org/ai) — CLI - [helix](https://github.com/open-ai-org/helix) — DNA optimizer