| # Kernel Development Notes β WYRM / GLADIUS |
|
|
| > "Build the spine before you grow the heads." |
|
|
| --- |
|
|
| ## Current State (Day 58 β April 8, 2026) |
|
|
| **Training:** v27 FINAL, 565M dense, ~2.67% complete (step ~400/15000) |
| **Architecture:** 1024d / 24L / 32H / 4096 FFN β Synthase depth attention |
| **Platform:** Kaggle T4 (16GB) β using ~5.56 GB VRAM |
| **Kernel modules:** 17 files in this directory (see below) |
|
|
| ## Where We Are |
|
|
| The kernel is forming. Synthase depth profiles, PUP uncertainty, SLAΒ², plug membranes, Gaussian head β all active, all training. The curriculum is running all four phases (foundation β reasoning β depth β omega). Loss is grinding down. The re-entry pattern (loss spikes then recovers lower) is the real signal. |
|
|
| The kernel IS the research contribution. Everything else builds on it. |
|
|
| ## The MoE Decision (Day 58) |
|
|
| ### Analysis Done |
| The router (`router.py`) exists with 5 specialist slots (reasoning, math, code, general, gaussian) but is only called for `balance_loss` β a regularization term on a decision that's never made. No token actually routes through specialist FFNs. |
|
|
| The model already has natural specialist pathways: |
| - **Language** β BPE tokenizer in, BPE logits out |
| - **Mathematics** β math tokenizer in, math logits out |
| - **Spatial** β Gaussian head (3D splats out) |
| - **Reasoning** β cognition module, depth-dependent, PUP uncertainty |
| - **Raw** β byte-level fallback |
|
|
| These are the Hydra's heads. They just aren't wired as MoE yet. |
|
|
| ### VRAM Calculation |
| | Config | Params | VRAM (T4, grad_ckpt) | Fits? | |
| |--------|--------|---------------------|-------| |
| | Dense (current) | 565M | 5.56 GB | β
| |
| | MoE 3 experts Γ 4096 | 770M | ~8.0 GB | β
| |
| | MoE 4 experts Γ 4096 | 971M | ~9.9 GB | β
| |
| | MoE 5 experts Γ 4096 | 1,173M | ~11.8 GB | β
| |
| |
| All configurations fit on T4. The 5-expert Hydra at 1.17B runs at ~same speed as dense (top-2 routing = same compute per token, more capacity). |
| |
| ### Decision: KERNEL FIRST, MoE LATER |
| |
| **Rationale:** |
| 1. The kernel innovations (Synthase, PUP, SLAΒ², memory, curriculum) need to prove themselves on the dense model first |
| 2. If something goes wrong with MoE, we can't tell if it's kernel or routing β two unknowns in one equation |
| 3. Progressive expansion (the paper) says: train small, prove it works, expand with knowledge intact |
| 4. The eval baseline must be clean β Synthase 565M vs Vanilla 565M, same data, same compute |
| 5. MoE warm-start from proven dense weights is strictly better than cold-wiring at step 400 |
| |
| **Sequence:** |
| 1. β
Let current 565M dense train through all 4 curriculum phases |
| 2. π² Eval at milestones (5K, 10K, 15K) β prove Synthase beats vanilla |
| 3. π² Wire MoE: copy dense FFN β 5 expert FFNs + small noise, router from plug membrane signals |
| 4. π² Continue training as 1.17B MoE β backbone representations transfer |
| 5. π² Paper: "Progressive Expansion from Dense to MoE" |
| |
| ### Optimal Hardware (when ready for MoE) |
| - **Free:** Kaggle T4 (fits) or L4 (12GB headroom) |
| - **Best value:** Used RTX 3090 24GB (~$800) β no session limits |
| - **Cloud:** Lambda A100 40GB ($1.10/hr) β when speed matters |
| |
| --- |
| |
| ## Kernel Module Inventory |
| |
| | File | Role | Status | |
| |------|------|--------| |
| | `kernel.py` | Main SynthaseKernel β forward pass, loss computation | Active, training | |
| | `config.py` | KernelConfig β architecture hyperparameters | Locked for v27 | |
| | `attention.py` | Synthase depth attention β the core innovation | Active | |
| | `embeddings.py` | Multi-tokenizer embeddings (BPE + math + byte) | Active | |
| | `memory.py` | Persistent memory module | Active | |
| | `moda.py` | Modality-aware processing | Active | |
| | `modulator.py` | Dynamic modulation | Active | |
| | `router.py` | NexusRouter β 5 specialists (UNUSED except balance_loss) | Skeleton β future MoE | |
| | `senses.py` | Plug membranes β domain gating at input | Active | |
| | `cognition.py` | Cognition module β depth-dependent reasoning | Active | |
| | `cognition_loss.py` | Cognition loss functions | Active | |
| | `temporal.py` | Temporal processing | Active | |
| | `temporal_lattice.py` | Temporal lattice structure | Active | |
| | `tools.py` | Grid tools registration | Active | |
| | `warm_memory.py` | Warm memory initialization | Active | |
|
|
| ## Staging Modules (gladius_v2/staging/kernel/) |
| - `l0_sla2/` β Sparse Lottery AttentionΒ² |
| - `pup/` β Probabilistic Uncertainty Propagation |
| - `synthase/` β Synthase depth attention (reference implementation) |
|
|
| --- |
|
|
| ## Direction |
|
|
| The dragon is grinding. The kernel is forming. Don't interrupt it. |
|
|
| When the dense model proves itself, the MoE expansion is calculated, budgeted, and ready. The math is done. The path is clear. The timing isn't now. |
|
|
| *"The soul does not crack."* |
|
|