File size: 1,484 Bytes
655d5a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ---
license: mit
tags:
- sparse-first
- helix
- mongoose
language:
- en
---
# Sparse-First Trained Transformer (dim=2048)
A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.
## Architecture
- **dim**: 2048
- **layers**: 4
- **heads**: 64 (GQA, 32 KV heads)
- **FFN**: 4096
- **vocab**: 256 (byte-level)
- **params**: 152M
## Training
Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090:
- **74 steps/s** at dim=2048
- gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
- Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
- Immune system: automatic checkpoint at loss floors, revert on rebound
## Multi-GPU
On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):
- **21.7 steps/s** at dim=4096 — 1.54x faster than PyTorch DDP
## Usage
```bash
brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"
```
## Paper
[Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md)
## Framework
- [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine
- [ai](https://github.com/open-ai-org/ai) — CLI
- [helix](https://github.com/open-ai-org/helix) — DNA optimizer
|