Sparse-First Trained Transformer (dim=2048)

A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.

Architecture

dim: 2048
layers: 4
heads: 64 (GQA, 32 KV heads)
FFN: 4096
vocab: 256 (byte-level)
params: 152M

Training

Trained with the Helix DNA optimizer on RTX 5090:

74 steps/s at dim=2048
gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
Immune system: automatic checkpoint at loss floors, revert on rebound

Multi-GPU

On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):

21.7 steps/s at dim=4096 — 1.54x faster than PyTorch DDP

Usage

brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"

Paper

Sparse-First Training: A Biologically-Inspired Framework

Framework

mongoose — GPU compute engine
ai — CLI
helix — DNA optimizer

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support