Gladius / kernel /NOTE.md
AVA
v29: full model files sync β€” kernel, staging, eval, docs, training notebook
771f949

Kernel Development Notes β€” WYRM / GLADIUS

"Build the spine before you grow the heads."


Current State (Day 58 β€” April 8, 2026)

Training: v27 FINAL, 565M dense, ~2.67% complete (step ~400/15000) Architecture: 1024d / 24L / 32H / 4096 FFN β€” Synthase depth attention Platform: Kaggle T4 (16GB) β€” using ~5.56 GB VRAM Kernel modules: 17 files in this directory (see below)

Where We Are

The kernel is forming. Synthase depth profiles, PUP uncertainty, SLAΒ², plug membranes, Gaussian head β€” all active, all training. The curriculum is running all four phases (foundation β†’ reasoning β†’ depth β†’ omega). Loss is grinding down. The re-entry pattern (loss spikes then recovers lower) is the real signal.

The kernel IS the research contribution. Everything else builds on it.

The MoE Decision (Day 58)

Analysis Done

The router (router.py) exists with 5 specialist slots (reasoning, math, code, general, gaussian) but is only called for balance_loss β€” a regularization term on a decision that's never made. No token actually routes through specialist FFNs.

The model already has natural specialist pathways:

  • Language β€” BPE tokenizer in, BPE logits out
  • Mathematics β€” math tokenizer in, math logits out
  • Spatial β€” Gaussian head (3D splats out)
  • Reasoning β€” cognition module, depth-dependent, PUP uncertainty
  • Raw β€” byte-level fallback

These are the Hydra's heads. They just aren't wired as MoE yet.

VRAM Calculation

Config Params VRAM (T4, grad_ckpt) Fits?
Dense (current) 565M 5.56 GB βœ…
MoE 3 experts Γ— 4096 770M ~8.0 GB βœ…
MoE 4 experts Γ— 4096 971M ~9.9 GB βœ…
MoE 5 experts Γ— 4096 1,173M ~11.8 GB βœ…

All configurations fit on T4. The 5-expert Hydra at 1.17B runs at ~same speed as dense (top-2 routing = same compute per token, more capacity).

Decision: KERNEL FIRST, MoE LATER

Rationale:

  1. The kernel innovations (Synthase, PUP, SLAΒ², memory, curriculum) need to prove themselves on the dense model first
  2. If something goes wrong with MoE, we can't tell if it's kernel or routing β€” two unknowns in one equation
  3. Progressive expansion (the paper) says: train small, prove it works, expand with knowledge intact
  4. The eval baseline must be clean β€” Synthase 565M vs Vanilla 565M, same data, same compute
  5. MoE warm-start from proven dense weights is strictly better than cold-wiring at step 400

Sequence:

  1. βœ… Let current 565M dense train through all 4 curriculum phases
  2. πŸ”² Eval at milestones (5K, 10K, 15K) β€” prove Synthase beats vanilla
  3. πŸ”² Wire MoE: copy dense FFN β†’ 5 expert FFNs + small noise, router from plug membrane signals
  4. πŸ”² Continue training as 1.17B MoE β€” backbone representations transfer
  5. πŸ”² Paper: "Progressive Expansion from Dense to MoE"

Optimal Hardware (when ready for MoE)

  • Free: Kaggle T4 (fits) or L4 (12GB headroom)
  • Best value: Used RTX 3090 24GB (~$800) β€” no session limits
  • Cloud: Lambda A100 40GB ($1.10/hr) β€” when speed matters

Kernel Module Inventory

File Role Status
kernel.py Main SynthaseKernel β€” forward pass, loss computation Active, training
config.py KernelConfig β€” architecture hyperparameters Locked for v27
attention.py Synthase depth attention β€” the core innovation Active
embeddings.py Multi-tokenizer embeddings (BPE + math + byte) Active
memory.py Persistent memory module Active
moda.py Modality-aware processing Active
modulator.py Dynamic modulation Active
router.py NexusRouter β€” 5 specialists (UNUSED except balance_loss) Skeleton β†’ future MoE
senses.py Plug membranes β€” domain gating at input Active
cognition.py Cognition module β€” depth-dependent reasoning Active
cognition_loss.py Cognition loss functions Active
temporal.py Temporal processing Active
temporal_lattice.py Temporal lattice structure Active
tools.py Grid tools registration Active
warm_memory.py Warm memory initialization Active

Staging Modules (gladius_v2/staging/kernel/)

  • l0_sla2/ β€” Sparse Lottery AttentionΒ²
  • pup/ β€” Probabilistic Uncertainty Propagation
  • synthase/ β€” Synthase depth attention (reference implementation)

Direction

The dragon is grinding. The kernel is forming. Don't interrupt it.

When the dense model proves itself, the MoE expansion is calculated, budgeted, and ready. The math is done. The path is clear. The timing isn't now.

"The soul does not crack."