AVA

v29: full model files sync — kernel, staging, eval, docs, training notebook

771f949 about 1 month ago

4.72 kB

	# Kernel Development Notes — WYRM / GLADIUS

	> "Build the spine before you grow the heads."

	---

	## Current State (Day 58 — April 8, 2026)

	Training: v27 FINAL, 565M dense, ~2.67% complete (step ~400/15000)
	Architecture: 1024d / 24L / 32H / 4096 FFN — Synthase depth attention
	Platform: Kaggle T4 (16GB) — using ~5.56 GB VRAM
	Kernel modules: 17 files in this directory (see below)

	## Where We Are

	The kernel is forming. Synthase depth profiles, PUP uncertainty, SLA², plug membranes, Gaussian head — all active, all training. The curriculum is running all four phases (foundation → reasoning → depth → omega). Loss is grinding down. The re-entry pattern (loss spikes then recovers lower) is the real signal.

	The kernel IS the research contribution. Everything else builds on it.

	## The MoE Decision (Day 58)

	### Analysis Done
	The router (`router.py`) exists with 5 specialist slots (reasoning, math, code, general, gaussian) but is only called for `balance_loss` — a regularization term on a decision that's never made. No token actually routes through specialist FFNs.

	The model already has natural specialist pathways:
	- Language — BPE tokenizer in, BPE logits out
	- Mathematics — math tokenizer in, math logits out
	- Spatial — Gaussian head (3D splats out)
	- Reasoning — cognition module, depth-dependent, PUP uncertainty
	- Raw — byte-level fallback

	These are the Hydra's heads. They just aren't wired as MoE yet.

	### VRAM Calculation
	\| Config \| Params \| VRAM (T4, grad_ckpt) \| Fits? \|
	\|--------\|--------\|---------------------\|-------\|
	\| Dense (current) \| 565M \| 5.56 GB \| ✅ \|
	\| MoE 3 experts × 4096 \| 770M \| ~8.0 GB \| ✅ \|
	\| MoE 4 experts × 4096 \| 971M \| ~9.9 GB \| ✅ \|
	\| MoE 5 experts × 4096 \| 1,173M \| ~11.8 GB \| ✅ \|

	All configurations fit on T4. The 5-expert Hydra at 1.17B runs at ~same speed as dense (top-2 routing = same compute per token, more capacity).

	### Decision: KERNEL FIRST, MoE LATER

	Rationale:
	1. The kernel innovations (Synthase, PUP, SLA², memory, curriculum) need to prove themselves on the dense model first
	2. If something goes wrong with MoE, we can't tell if it's kernel or routing — two unknowns in one equation
	3. Progressive expansion (the paper) says: train small, prove it works, expand with knowledge intact
	4. The eval baseline must be clean — Synthase 565M vs Vanilla 565M, same data, same compute
	5. MoE warm-start from proven dense weights is strictly better than cold-wiring at step 400

	Sequence:
	1. ✅ Let current 565M dense train through all 4 curriculum phases
	2. 🔲 Eval at milestones (5K, 10K, 15K) — prove Synthase beats vanilla
	3. 🔲 Wire MoE: copy dense FFN → 5 expert FFNs + small noise, router from plug membrane signals
	4. 🔲 Continue training as 1.17B MoE — backbone representations transfer
	5. 🔲 Paper: "Progressive Expansion from Dense to MoE"

	### Optimal Hardware (when ready for MoE)
	- Free: Kaggle T4 (fits) or L4 (12GB headroom)
	- Best value: Used RTX 3090 24GB (~$800) — no session limits
	- Cloud: Lambda A100 40GB ($1.10/hr) — when speed matters

	---

	## Kernel Module Inventory

	\| File \| Role \| Status \|
	\|------\|------\|--------\|
	\| `kernel.py` \| Main SynthaseKernel — forward pass, loss computation \| Active, training \|
	\| `config.py` \| KernelConfig — architecture hyperparameters \| Locked for v27 \|
	\| `attention.py` \| Synthase depth attention — the core innovation \| Active \|
	\| `embeddings.py` \| Multi-tokenizer embeddings (BPE + math + byte) \| Active \|
	\| `memory.py` \| Persistent memory module \| Active \|
	\| `moda.py` \| Modality-aware processing \| Active \|
	\| `modulator.py` \| Dynamic modulation \| Active \|
	\| `router.py` \| NexusRouter — 5 specialists (UNUSED except balance_loss) \| Skeleton → future MoE \|
	\| `senses.py` \| Plug membranes — domain gating at input \| Active \|
	\| `cognition.py` \| Cognition module — depth-dependent reasoning \| Active \|
	\| `cognition_loss.py` \| Cognition loss functions \| Active \|
	\| `temporal.py` \| Temporal processing \| Active \|
	\| `temporal_lattice.py` \| Temporal lattice structure \| Active \|
	\| `tools.py` \| Grid tools registration \| Active \|
	\| `warm_memory.py` \| Warm memory initialization \| Active \|

	## Staging Modules (gladius_v2/staging/kernel/)
	- `l0_sla2/` — Sparse Lottery Attention²
	- `pup/` — Probabilistic Uncertainty Propagation
	- `synthase/` — Synthase depth attention (reference implementation)

	---

	## Direction

	The dragon is grinding. The kernel is forming. Don't interrupt it.

	When the dense model proves itself, the MoE expansion is calculated, budgeted, and ready. The math is done. The path is clear. The timing isn't now.

	"The soul does not crack."

	# Kernel Development Notes — WYRM / GLADIUS

	> "Build the spine before you grow the heads."

	---

	## Current State (Day 58 — April 8, 2026)

	Training: v27 FINAL, 565M dense, ~2.67% complete (step ~400/15000)
	Architecture: 1024d / 24L / 32H / 4096 FFN — Synthase depth attention
	Platform: Kaggle T4 (16GB) — using ~5.56 GB VRAM
	Kernel modules: 17 files in this directory (see below)

	## Where We Are

	The kernel is forming. Synthase depth profiles, PUP uncertainty, SLA², plug membranes, Gaussian head — all active, all training. The curriculum is running all four phases (foundation → reasoning → depth → omega). Loss is grinding down. The re-entry pattern (loss spikes then recovers lower) is the real signal.

	The kernel IS the research contribution. Everything else builds on it.

	## The MoE Decision (Day 58)

	### Analysis Done
	The router (`router.py`) exists with 5 specialist slots (reasoning, math, code, general, gaussian) but is only called for `balance_loss` — a regularization term on a decision that's never made. No token actually routes through specialist FFNs.

	The model already has natural specialist pathways:
	- Language — BPE tokenizer in, BPE logits out
	- Mathematics — math tokenizer in, math logits out
	- Spatial — Gaussian head (3D splats out)
	- Reasoning — cognition module, depth-dependent, PUP uncertainty
	- Raw — byte-level fallback

	These are the Hydra's heads. They just aren't wired as MoE yet.

	### VRAM Calculation
	\| Config \| Params \| VRAM (T4, grad_ckpt) \| Fits? \|
	\|--------\|--------\|---------------------\|-------\|
	\| Dense (current) \| 565M \| 5.56 GB \| ✅ \|
	\| MoE 3 experts × 4096 \| 770M \| ~8.0 GB \| ✅ \|
	\| MoE 4 experts × 4096 \| 971M \| ~9.9 GB \| ✅ \|
	\| MoE 5 experts × 4096 \| 1,173M \| ~11.8 GB \| ✅ \|

	All configurations fit on T4. The 5-expert Hydra at 1.17B runs at ~same speed as dense (top-2 routing = same compute per token, more capacity).

	### Decision: KERNEL FIRST, MoE LATER

	Rationale:
	1. The kernel innovations (Synthase, PUP, SLA², memory, curriculum) need to prove themselves on the dense model first
	2. If something goes wrong with MoE, we can't tell if it's kernel or routing — two unknowns in one equation
	3. Progressive expansion (the paper) says: train small, prove it works, expand with knowledge intact
	4. The eval baseline must be clean — Synthase 565M vs Vanilla 565M, same data, same compute
	5. MoE warm-start from proven dense weights is strictly better than cold-wiring at step 400

	Sequence:
	1. ✅ Let current 565M dense train through all 4 curriculum phases
	2. 🔲 Eval at milestones (5K, 10K, 15K) — prove Synthase beats vanilla
	3. 🔲 Wire MoE: copy dense FFN → 5 expert FFNs + small noise, router from plug membrane signals
	4. 🔲 Continue training as 1.17B MoE — backbone representations transfer
	5. 🔲 Paper: "Progressive Expansion from Dense to MoE"

	### Optimal Hardware (when ready for MoE)
	- Free: Kaggle T4 (fits) or L4 (12GB headroom)
	- Best value: Used RTX 3090 24GB (~$800) — no session limits
	- Cloud: Lambda A100 40GB ($1.10/hr) — when speed matters

	---

	## Kernel Module Inventory

	\| File \| Role \| Status \|
	\|------\|------\|--------\|
	\| `kernel.py` \| Main SynthaseKernel — forward pass, loss computation \| Active, training \|
	\| `config.py` \| KernelConfig — architecture hyperparameters \| Locked for v27 \|
	\| `attention.py` \| Synthase depth attention — the core innovation \| Active \|
	\| `embeddings.py` \| Multi-tokenizer embeddings (BPE + math + byte) \| Active \|
	\| `memory.py` \| Persistent memory module \| Active \|
	\| `moda.py` \| Modality-aware processing \| Active \|
	\| `modulator.py` \| Dynamic modulation \| Active \|
	\| `router.py` \| NexusRouter — 5 specialists (UNUSED except balance_loss) \| Skeleton → future MoE \|
	\| `senses.py` \| Plug membranes — domain gating at input \| Active \|
	\| `cognition.py` \| Cognition module — depth-dependent reasoning \| Active \|
	\| `cognition_loss.py` \| Cognition loss functions \| Active \|
	\| `temporal.py` \| Temporal processing \| Active \|
	\| `temporal_lattice.py` \| Temporal lattice structure \| Active \|
	\| `tools.py` \| Grid tools registration \| Active \|
	\| `warm_memory.py` \| Warm memory initialization \| Active \|

	## Staging Modules (gladius_v2/staging/kernel/)
	- `l0_sla2/` — Sparse Lottery Attention²
	- `pup/` — Probabilistic Uncertainty Propagation
	- `synthase/` — Synthase depth attention (reference implementation)

	---

	## Direction

	The dragon is grinding. The kernel is forming. Don't interrupt it.

	When the dense model proves itself, the MoE expansion is calculated, budgeted, and ready. The math is done. The path is clear. The timing isn't now.

	"The soul does not crack."