---
license: mit
tags:
- sparse-first
- helix
- mongoose
language:
- en
---

# Sparse-First Trained Transformer (dim=2048)

A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.

## Architecture

- **dim**: 2048
- **layers**: 4
- **heads**: 64 (GQA, 32 KV heads)
- **FFN**: 4096
- **vocab**: 256 (byte-level)
- **params**: 152M

## Training

Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090:
- **74 steps/s** at dim=2048
- gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
- Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
- Immune system: automatic checkpoint at loss floors, revert on rebound

## Multi-GPU

On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):
- **21.7 steps/s** at dim=4096 — 1.54x faster than PyTorch DDP

## Usage

```bash
brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"
```

## Paper

[Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md)

## Framework

- [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine
- [ai](https://github.com/open-ai-org/ai) — CLI
- [helix](https://github.com/open-ai-org/helix) — DNA optimizer