sparse-first-2048 / README.md
swizzley88's picture
Upload folder using huggingface_hub
655d5a2 verified
---
license: mit
tags:
- sparse-first
- helix
- mongoose
language:
- en
---
# Sparse-First Trained Transformer (dim=2048)
A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.
## Architecture
- **dim**: 2048
- **layers**: 4
- **heads**: 64 (GQA, 32 KV heads)
- **FFN**: 4096
- **vocab**: 256 (byte-level)
- **params**: 152M
## Training
Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090:
- **74 steps/s** at dim=2048
- gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
- Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
- Immune system: automatic checkpoint at loss floors, revert on rebound
## Multi-GPU
On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):
- **21.7 steps/s** at dim=4096 — 1.54x faster than PyTorch DDP
## Usage
```bash
brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"
```
## Paper
[Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md)
## Framework
- [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine
- [ai](https://github.com/open-ai-org/ai) — CLI
- [helix](https://github.com/open-ai-org/helix) — DNA optimizer