| --- |
| license: mit |
| tags: |
| - sparse-first |
| - helix |
| - mongoose |
| language: |
| - en |
| --- |
| |
| # Sparse-First Trained Transformer (dim=2048) |
|
|
| A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model. |
|
|
| ## Architecture |
|
|
| - **dim**: 2048 |
| - **layers**: 4 |
| - **heads**: 64 (GQA, 32 KV heads) |
| - **FFN**: 4096 |
| - **vocab**: 256 (byte-level) |
| - **params**: 152M |
|
|
| ## Training |
|
|
| Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090: |
| - **74 steps/s** at dim=2048 |
| - gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer |
| - Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback |
| - Immune system: automatic checkpoint at loss floors, revert on rebound |
|
|
| ## Multi-GPU |
|
|
| On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism): |
| - **21.7 steps/s** at dim=4096 — 1.54x faster than PyTorch DDP |
|
|
| ## Usage |
|
|
| ```bash |
| brew install open-ai-org/tap/ai |
| ai pull open-ai-org/sparse-first-2048 |
| ai infer sparse-first-2048 "Hello" |
| ``` |
|
|
| ## Paper |
|
|
| [Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md) |
|
|
| ## Framework |
|
|
| - [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine |
| - [ai](https://github.com/open-ai-org/ai) — CLI |
| - [helix](https://github.com/open-ai-org/helix) — DNA optimizer |
|
|