Sparse-First Trained Transformer (dim=2048)

A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.

Architecture

  • dim: 2048
  • layers: 4
  • heads: 64 (GQA, 32 KV heads)
  • FFN: 4096
  • vocab: 256 (byte-level)
  • params: 152M

Training

Trained with the Helix DNA optimizer on RTX 5090:

  • 74 steps/s at dim=2048
  • gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
  • Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
  • Immune system: automatic checkpoint at loss floors, revert on rebound

Multi-GPU

On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):

  • 21.7 steps/s at dim=4096 — 1.54x faster than PyTorch DDP

Usage

brew install open-ai-org/tap/ai
ai pull open-ai-org/sparse-first-2048
ai infer sparse-first-2048 "Hello"

Paper

Sparse-First Training: A Biologically-Inspired Framework

Framework

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support