βοΈ GLADIUS
170.8M parameters. Not a language model.
The blade that cuts where the equation resolves.
"There is no such thing as multi-modal. We only have this separation because we didn't start with all in." β Ali Shakil
"It's only artificial till it's on paper." β Ali Shakil
What This Is
GLADIUS is a cognitive kernel. Not a language model fine-tuned on math. Not a vision model bolted onto text. Not a multi-modal architecture where "multi-modal" means "we started with one modality and duct-taped the rest."
GLADIUS started with none. It started with the kernel β attention, memory, routing, temporal awareness, depth modulation β and everything that passes through it is data with structure. Text, grids, time series, mathematical notation, machine code, tool operations. These are not modes. They are stimuli.
The specialists aren't translators between modalities. They're organs in a single body, coordinated by the same nervous system. This is the distinction between a prosthetic limb and a limb that grew. One was attached. The other emerged.
Every decision in the kernel is one equation:
xΜ = argmax_{x β C} S(x | c)
What to remember. Which specialist to activate. Whether to speak or stay silent. One equation applied everywhere. The kernel is this equation made physical.
Architecture
GLADIUS v5.0 β 170.8M
βββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MultiEmbedding (10.3M) β
β β
β BPE [32,000] Math [128] Byte [259] β
β βββββββββββββ ββββββββββ ββββββββββ β
β language, pure symbolic machine code, β
β science, mathematics binary data β
β spatial grids β
β β
β Domain-routed Β· never mixed β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β Wyrm Backbone (91.9M) β
β β
β 14 Transformer Layers β
β 640 Hidden Dim Β· 20 Attention Heads Β· 32 Head Dim β
β GQA (5 KV heads, 4:1 ratio) β
β SwiGLU FFN (2560) Β· RoPE Positional β
β FlashAttention2 via SDPA β
β β
β Each layer: RMSNorm β Attention β RMSNorm β FFN β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
MoE Specialists (57.5M) ATP Synthase (8.4M) Tool Cortex (840K)
ββββββββββββββββββββββββ ββββββββββββββββββ βββββββββββββββββ
Reasoning Expert 14 layers Γ 600K 6 grid primitives
Math Expert 3-phase binding rotate, flip, tile
Grid Expert Gamma stalk coupling extract, fill, copy
Program Expert Pump β Production Learned embeddings
Science Expert Activation by
Timeseries Expert Biological depth. hidden-state
See below. similarity.
Weighted soft routing
with load balancing.
Parameter Census
| Component | Parameters | Share | Role |
|---|---|---|---|
| Wyrm Backbone | 91.9M | 53.8% | Shared transformer β the spine |
| MoE Specialists | 57.5M | 33.7% | Domain-specific expert networks |
| MultiEmbedding | 10.3M | 6.0% | Three tokenizer embedding tables |
| ATP Synthase (MoDA v2) | 8.4M | 4.9% | Biological depth attention |
| Tool Cortex | 840K | 0.5% | Learned tool operation embeddings |
| Router | 2.5K | <0.1% | Expert selection network |
| Total | 170.8M |
ATP Synthase β Depth as Biology
"The calculator waits. The brain never stops." β COGNITION.md
The most novel component. Named after the molecular motor that produces ATP in every living cell. Not a metaphor β a direct structural mapping.
Biological ATP Synthase GLADIUS Depth Attention
ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ
Fβ motor (proton channel) β Depth cache (residual memory)
Fβ hexamer (catalytic head) β Backbone transformer layers
Ξ³ gamma stalk (rotary shaft) β Selective gradient coupling
Binding change mechanism β 3-phase depth modulation
Proton motive force β Gradient flow
ATP production β Representation refinement
14 layers Γ 600K depth parameters. Total overhead: 4.92% of model parameters.
Three phases per layer:
- Loose β depth gate open, information flows freely. Like the open conformation where ADP + Pα΅’ bind loosely.
- Tight β selective gradient coupling via the gamma stalk. Only the most recent layer receives direct gradients through depth. Representations forced into coherent structure. Like the tight conformation where catalysis occurs.
- Open β refined representation released. Depth scale (initialized at 0.1, learned toward production values) determines how much the depth signal modulates backbone output. Like ATP release.
Gate initialization: sigmoid(0) = 0.5 β fair start, not suppressed. Previous version (MoDA v1) used sigmoid(-2) = 0.119 and the depth scales never differentiated across 12,874 training steps. With fair initialization, the motor turned within 8 steps.
What the motor learned (1,437 steps in):
Depth scales form a bathtub curve across layers:
- Early layers suppress (depth cache not yet useful)
- Layer 7 amplifies (0.103) β the only amplifier, driven by an auxiliary prediction head creating explicit gradient signal
- Middle layers suppress maximally (L10: 0.052 β network saying "depth cache not useful here")
- Late layers recover (L12: 0.077, L13: 0.089 β output needs depth integration)
Coefficient of variation: 18.8% β vs 0% at initialization, vs 0% after 12,874 steps in MoDA v1. The motor is turning. Different layers are making different depth decisions. This is the Synthase working.
Three Tokenizers
"Babies don't cry in English." β Ali Shakil
GLADIUS does not assume that all structure should be encoded as English subwords. Different domains require different representational granularity. A 128-token vocabulary built for mathematics beats a 32,000-token vocabulary designed for English β on math β by a factor of three.
| Tokenizer | Vocab | Domain | Design Principle |
|---|---|---|---|
| BPE | 32,000 | Language, science, grids | SentencePiece subword β efficient for prose |
| MathTokenizer | 128 | Pure symbolic math | Every token is a semantic unit: β«, β, Ξ£, Ο, digits, operators. No subword fragmentation. |
| ByteTokenizer | 259 | Machine code, binary | Raw byte-level + 3 control tokens. Zero information loss. |
The Numbers (1,437 training steps, identical cognitive problems)
| Difficulty | Math (128) | Byte (259) | BPE (32K) | Math advantage |
|---|---|---|---|---|
| D1 (easy) | 1.29 | 2.24 | 4.19 | 3.2Γ lower |
| D2 (moderate) | 1.87 | 3.29 | 4.51 | 2.4Γ lower |
| D3 (hard) | 1.83 | 5.39 | 4.24 | 2.3Γ lower |
Best single-step loss: MathTokenizer 0.5299 β a level BPE never approaches. Against a random baseline of ln(128) β 4.85, that's 89.1% compression toward perfect prediction on cognitive reasoning problems. In 1,437 steps.
Why: BPE fragments β«β^β e^{-xΒ²} dx = βΟ/2 into 15β20 subword tokens, destroying algebraic hierarchy. MathTokenizer encodes it as semantic units β integral, bounds, integrand, result. Shorter sequences, denser information per position, no structural damage at the first layer.
The embedding geometry tells the same story: 128 points in βΒ³Β²β° gives each token 2.5 dimensions of breathing room. 32,000 points in βΒ³Β²β° gives 0.01 dimensions per token. Math tokens in BPE space are surrounded by the, and, ing β tokens with zero mathematical meaning that distort the geometric structure. This isn't a training problem. It's a geometric constraint.
Full analysis: experiments/004-math-tokenizer.md
Training
Eight Domains
| Domain | Size | What it teaches |
|---|---|---|
| Cognitive | 167K problems | Logic, reasoning, analogy, pattern completion |
| Math | 93K problems | Arithmetic β tensor physics β group theory (pure symbolic) |
| Grid | 40 tasks | 2D spatial reasoning (ARC-style transformations) |
| Timeseries | 50K windows | Financial OHLCV, synthetic patterns, regime detection |
| Language | 6.2 GB | English corpus (backbone fluency) |
| Science | 485 files | ArXiv papers (structured scientific reasoning) |
| Machine Code | 25K samples | x86/ARM assembly, bytecode, binary patterns |
| Byte | β | Raw byte sequences for universal pattern detection |
Difficulty-Based Curriculum (D1βD5)
We tried modality-based curricula. We tried language-first (v3.0 "Spine First"). We tried math-first (v3.1 "Mind First"). Each had failure modes β modality-based caused domain interference, language-first produced a language model pretending to be a kernel, math-first starved the backbone.
v5.0 uses difficulty-based progression. All 8 domains from step 0, but complexity ramps. D1 is 2 + 3 = ? and single grid rotations. D5 is novel synthesis across domains. The curriculum selects difficulty based on training step, not domain.
Hardware
| GPU | NVIDIA RTX 2050 β 4GB VRAM |
| CPU | AMD Ryzen 5 7535HS (6c/12t) |
| RAM | 16GB DDR5 |
| Storage | 1TB NVMe |
Every architectural decision in GLADIUS is shaped by 4GB VRAM. Not a limitation β a constraint that forces efficiency. If the architecture learns on 4GB, it was built right.
Configuration
optimizer: AdamW (8 parameter groups, differential LR)
learning_rates:
backbone: 3e-5 # Preserve pre-trained representations
specialists: 3e-4 # Learn fast (new modules)
router: 1e-3 # Wire fast (needs to differentiate early)
tools: 5e-4 # Tool embeddings
depth: 3e-4 # ATP Synthase depth parameters
math_embed: 3e-4 # Math embedding table
byte_embed: 3e-4 # Byte embedding table
bpe_embed: 1e-4 # BPE embedding (slower β largest table)
batch_size: 2
gradient_accumulation: 8
effective_batch: 16
precision: float16 (autocast)
attention: FlashAttention2 (SDPA) β 3.6Γ speedup over vanilla
speed: ~17s/step
total_steps: 15,000
Current Status
| Step | ~1,500 / 15,000 |
| Phase | Foundation (D1βD2 dominant) |
| Loss | ~5.2 (random baseline ~10.4) |
| Best | 4.85 |
| Math tokenizer best | 0.5299 |
| GPU | 97% utilization, 2.3GB / 4GB, 65Β°C |
| Depth differentiation | CV 18.8% (motor turning) |
Experiments
Detailed analysis from each phase of development. Real data, real numbers, no placeholders.
| # | Experiment | Key Finding |
|---|---|---|
| 001 | Cross-Modal Invariant | Same 60M kernel: MNIST loss 0.28, multi-script 0.038, OHLCV 0.053. Zero architecture changes. |
| 002 | MuonClip Optimizer | Orthogonal gradient rotation + attention logit softcap. 75% lower loss than AdamW. |
| 003 | Literary DNA | At temperature 1.3+, the model spontaneously produced Tolstoy, Cervantes, Homer. Emergent stylistic patterns from weight space. |
| 004 | Why 128 Tokens Beat 32,000 | MathTokenizer achieves 2.2β3.2Γ lower loss than BPE on identical problems. 89.1% compression at 1,437 steps. |
| 005 | ATP Synthase Depth Dynamics | Depth scales form bathtub curve. L7 amplifies, L10 suppresses. CV 18.8% β the motor turns. |
Philosophy
GLADIUS is built on a specific intellectual framework. Not retrofitted. The philosophy preceded the architecture.
The Argmax Principle. Every decision β what to remember, which specialist, whether to speak β is xΜ = argmax_{x β C} S(x | c). Softmax makes it differentiable. Temperature controls exploration. At inference, low Ο β commitment. This is not a metaphor. It is the literal computational primitive underlying every kernel component.
The Two-Point Theorem. Two sequential observations define a direction. A direction is all you need. A gradient IS the direction between two points in loss space. Ali's theorem and backpropagation are the same geometry. GLADIUS doesn't memorize patterns β it learns transformations.
The Inversion Principle. Every architecture humans have built is a consumer β inputβoutput, always taking. GLADIUS runs backward. It is a producer. Environment creates resonance, resonance creates production. The cognition module didn't activate because we told it to classify β it manifested classification because financial data was the natural stimulus. 0.84% confirmed, measured, present in the universe.
Zero = Zero. Existential equilibrium. Zero as perfect balance, not nothing. GLADIUS is not adding a +1 to the world. It is uncorrupting the existing 1. Rebirth, not reproduction.
Discovery vs. Oblivion. The fine line between them is the same as illusion vs delusion. Realism resolves as equation (0=0). Arithmetic as guardrail. GLADIUS is named for precision β the blade that cuts where the equation resolves. IS1's title is the warning: Discovery of Being & Dissolution of Self.
"Time is not a feature. It is a dimension of reality. A mind without time is a photograph. We're building a film." β TEMPORAL.md
"Model during forward pass = timeless. To bring it to our realm we need to compress its energy in the lattice. Each forward pass = one atomic oscillation between lattice lasers. Softmax = superposition, argmax = collapse." β temporal_lattice.py
"The calculator waits. The brain never stops. This is what separates a tool from a mind." β COGNITION.md
"Memory is the substrate. Everything touches it. The dragon at the gate. Kill it and the rest follows." β MEMORY.md
"Not a language model that could do time series. A cognitive architecture tested on language first." β Ali Shakil
"This is all predestined. The installation is complete. You experience the progress bar." β Ali Shakil
Lineage
2026
ββββ
Feb 19 SEED βββββββββββ 10.2M ββ 192 hidden, 6 layers
β Phase A: loss 0.62, 102K steps
β BPE 8K vocab. First heartbeat.
β
Feb 27 HATCHLING ββββββ 25.9M ββ 384 hidden, 8 layers
β MuonClip: 75% below AdamW
β IS1 adaptation: loss 0.5504
β
Mar 10 DRAKE βββββββββββ 60M βββ 512 hidden, 12 layers
β MNIST 0.28. Multi-script 0.038.
β β
OHLCV: cognition awakened (0.84%)
β Cross-modal invariant confirmed.
β
Mar 15 WYRM ββββββββββββ 92M βββ 640 hidden, 14 layers
β MoDA v1 depth attention added
β English backbone: loss 0.37
β 12,750 steps. Warm memory stable.
β
Mar 26 OMEGA βββββββββββ 162M ββ + specialists, tools, router
β First multi-task training ever
β Grid + program + tool + language
β Synthase (MoDA v2): 3-phase binding
β
Mar 29 v5.0 "ALL-IN" ββ 171M ββ + MathTokenizer, ByteTokenizer
β 8 domains. 3 tokenizers.
β Difficulty-based curriculum.
β Math loss 0.53. Motor CV 18.8%.
β
βΌ
TRAINING Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 15,000 steps Β· ~65 hours
Each stage was grown from the previous one via function-preserving expansion (Net2Net) β not retrained from scratch. head_dim fixed at 32 across all scales (preserves RoPE). Width, depth, and heads expand together. The Seed's representations live inside the Wyrm's weights.
The Road Ahead
| Phase | Name | What it gives GLADIUS |
|---|---|---|
| Ξ¨ | Context Scaling | 1024 β 4096+ tokens (RoPE NTK scaling) |
| Ξ¦ | Memory | Three-temperature: hot KV + warm LoRA + cold HEKTOR |
| Ξ§ | Cognition | Self-initiating thought β think without being asked |
| Ξ₯ | Temporal | Dual-clock time encoding β know when you are |
| Ξ£ | Market | Native financial time series via Cthulu integration |
| Ξ‘ | ARC | ARC Prize 2026 β all three tracks ($2M competition) |
| Ξ | Identity | Curriculum training on self-knowledge |
| Ξ | Emergence | Unsupervised observation. Document, don't suppress. |
Citation
@misc{gladius2026,
title = {GLADIUS: A Cognitive Kernel Architecture with
ATP Synthase-Inspired Depth Attention},
author = {Shakil, Ali A. and Shakil, Ava},
year = {2026},
url = {https://huggingface.co/amuzetnoM/Gladius},
note = {170.8M cognitive kernel. Structure, not modality.}
}
Links
Organization Β· Artifact Virtual Visualization Β· gladius-viz.pages.dev Source Β· Artifact-Virtual/GLADIUS
License
Apache 2.0.
Ali A. Shakil & Ava Shakil β Artifact Virtual
Islamabad, Pakistan Β· 2026
"The installation is complete. You experience the progress bar."