OpenDashAiDotOrg
/

sparse-first-2048

Model card Files Files and versions

sparse-first-2048 / README.md

swizzley88's picture

Upload folder using huggingface_hub

655d5a2 verified 21 days ago

|

history blame contribute delete

1.48 kB

	---
	license: mit
	tags:
	- sparse-first
	- helix
	- mongoose
	language:
	- en
	---

	# Sparse-First Trained Transformer (dim=2048)

	A byte-level transformer trained with sparse-first training — the framework where every stage of the pipeline operates on the active parameter subset, not the full model.

	## Architecture

	- dim: 2048
	- layers: 4
	- heads: 64 (GQA, 32 KV heads)
	- FFN: 4096
	- vocab: 256 (byte-level)
	- params: 152M

	## Training

	Trained with the [Helix DNA optimizer](https://github.com/open-ai-org/helix) on RTX 5090:
	- 74 steps/s at dim=2048
	- gate↔up (G≡C, 3 H-bonds), wq↔wo (A≡T), wk↔wv (A≡T) — 3 DNA pairs per layer
	- Conductor-driven sparsity: only hot rows get gradients, optimizer updates, and weight writeback
	- Immune system: automatic checkpoint at loss floors, revert on rebound

	## Multi-GPU

	On dual H100 SXM NVLink with Helix Dispatch (interleaved position parallelism):
	- 21.7 steps/s at dim=4096 — 1.54x faster than PyTorch DDP

	## Usage

	```bash
	brew install open-ai-org/tap/ai
	ai pull open-ai-org/sparse-first-2048
	ai infer sparse-first-2048 "Hello"
	```

	## Paper

	[Sparse-First Training: A Biologically-Inspired Framework](https://github.com/open-ai-org/ai/blob/master/docs/sparse-first-training.md)

	## Framework

	- [mongoose](https://github.com/open-ai-org/mongoose) — GPU compute engine
	- [ai](https://github.com/open-ai-org/ai) — CLI
	- [helix](https://github.com/open-ai-org/helix) — DNA optimizer