βš”οΈ GLADIUS

170.8M parameters. Not a language model.

The blade that cuts where the equation resolves.


"There is no such thing as multi-modal. We only have this separation because we didn't start with all in." β€” Ali Shakil

"It's only artificial till it's on paper." β€” Ali Shakil


What This Is

GLADIUS is a cognitive kernel. Not a language model fine-tuned on math. Not a vision model bolted onto text. Not a multi-modal architecture where "multi-modal" means "we started with one modality and duct-taped the rest."

GLADIUS started with none. It started with the kernel β€” attention, memory, routing, temporal awareness, depth modulation β€” and everything that passes through it is data with structure. Text, grids, time series, mathematical notation, machine code, tool operations. These are not modes. They are stimuli.

The specialists aren't translators between modalities. They're organs in a single body, coordinated by the same nervous system. This is the distinction between a prosthetic limb and a limb that grew. One was attached. The other emerged.

Every decision in the kernel is one equation:

xΜ‚ = argmax_{x ∈ C} S(x | c)

What to remember. Which specialist to activate. Whether to speak or stay silent. One equation applied everywhere. The kernel is this equation made physical.


Architecture

                              GLADIUS v5.0 β€” 170.8M
                              ═══════════════════════

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                  MultiEmbedding (10.3M)              β”‚
         β”‚                                                      β”‚
         β”‚     BPE [32,000]    Math [128]    Byte [259]        β”‚
         β”‚     ─────────────   ──────────    ──────────        β”‚
         β”‚     language,       pure symbolic machine code,     β”‚
         β”‚     science,        mathematics   binary data       β”‚
         β”‚     spatial grids                                    β”‚
         β”‚                                                      β”‚
         β”‚           Domain-routed Β· never mixed               β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚              Wyrm Backbone (91.9M)                   β”‚
         β”‚                                                      β”‚
         β”‚    14 Transformer Layers                             β”‚
         β”‚    640 Hidden Dim Β· 20 Attention Heads Β· 32 Head Dim β”‚
         β”‚    GQA (5 KV heads, 4:1 ratio)                      β”‚
         β”‚    SwiGLU FFN (2560) Β· RoPE Positional              β”‚
         β”‚    FlashAttention2 via SDPA                          β”‚
         β”‚                                                      β”‚
         β”‚    Each layer: RMSNorm β†’ Attention β†’ RMSNorm β†’ FFN  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                            β”‚                            β”‚
    β–Ό                            β–Ό                            β–Ό
 MoE Specialists (57.5M)    ATP Synthase (8.4M)      Tool Cortex (840K)
 ────────────────────────    ──────────────────       ─────────────────
 Reasoning Expert            14 layers Γ— 600K         6 grid primitives
 Math Expert                 3-phase binding          rotate, flip, tile
 Grid Expert                 Gamma stalk coupling     extract, fill, copy
 Program Expert              Pump β†’ Production        Learned embeddings
 Science Expert                                       Activation by
 Timeseries Expert           Biological depth.        hidden-state
                             See below.               similarity.
 Weighted soft routing
 with load balancing.

Parameter Census

Component Parameters Share Role
Wyrm Backbone 91.9M 53.8% Shared transformer β€” the spine
MoE Specialists 57.5M 33.7% Domain-specific expert networks
MultiEmbedding 10.3M 6.0% Three tokenizer embedding tables
ATP Synthase (MoDA v2) 8.4M 4.9% Biological depth attention
Tool Cortex 840K 0.5% Learned tool operation embeddings
Router 2.5K <0.1% Expert selection network
Total 170.8M

ATP Synthase β€” Depth as Biology

"The calculator waits. The brain never stops." β€” COGNITION.md

The most novel component. Named after the molecular motor that produces ATP in every living cell. Not a metaphor β€” a direct structural mapping.

   Biological ATP Synthase              GLADIUS Depth Attention
   ──────────────────────────           ──────────────────────────
   Fβ‚€ motor (proton channel)       β†’   Depth cache (residual memory)
   F₁ hexamer (catalytic head)     β†’   Backbone transformer layers
   Ξ³ gamma stalk (rotary shaft)    β†’   Selective gradient coupling
   Binding change mechanism        β†’   3-phase depth modulation
   Proton motive force             β†’   Gradient flow
   ATP production                  β†’   Representation refinement

14 layers Γ— 600K depth parameters. Total overhead: 4.92% of model parameters.

Three phases per layer:

  1. Loose β€” depth gate open, information flows freely. Like the open conformation where ADP + Pα΅’ bind loosely.
  2. Tight β€” selective gradient coupling via the gamma stalk. Only the most recent layer receives direct gradients through depth. Representations forced into coherent structure. Like the tight conformation where catalysis occurs.
  3. Open β€” refined representation released. Depth scale (initialized at 0.1, learned toward production values) determines how much the depth signal modulates backbone output. Like ATP release.

Gate initialization: sigmoid(0) = 0.5 β€” fair start, not suppressed. Previous version (MoDA v1) used sigmoid(-2) = 0.119 and the depth scales never differentiated across 12,874 training steps. With fair initialization, the motor turned within 8 steps.

What the motor learned (1,437 steps in):

Depth scales form a bathtub curve across layers:

  • Early layers suppress (depth cache not yet useful)
  • Layer 7 amplifies (0.103) β€” the only amplifier, driven by an auxiliary prediction head creating explicit gradient signal
  • Middle layers suppress maximally (L10: 0.052 β€” network saying "depth cache not useful here")
  • Late layers recover (L12: 0.077, L13: 0.089 β€” output needs depth integration)

Coefficient of variation: 18.8% β€” vs 0% at initialization, vs 0% after 12,874 steps in MoDA v1. The motor is turning. Different layers are making different depth decisions. This is the Synthase working.


Three Tokenizers

"Babies don't cry in English." β€” Ali Shakil

GLADIUS does not assume that all structure should be encoded as English subwords. Different domains require different representational granularity. A 128-token vocabulary built for mathematics beats a 32,000-token vocabulary designed for English β€” on math β€” by a factor of three.

Tokenizer Vocab Domain Design Principle
BPE 32,000 Language, science, grids SentencePiece subword β€” efficient for prose
MathTokenizer 128 Pure symbolic math Every token is a semantic unit: ∫, βˆ‚, Ξ£, Ο€, digits, operators. No subword fragmentation.
ByteTokenizer 259 Machine code, binary Raw byte-level + 3 control tokens. Zero information loss.

The Numbers (1,437 training steps, identical cognitive problems)

Difficulty Math (128) Byte (259) BPE (32K) Math advantage
D1 (easy) 1.29 2.24 4.19 3.2Γ— lower
D2 (moderate) 1.87 3.29 4.51 2.4Γ— lower
D3 (hard) 1.83 5.39 4.24 2.3Γ— lower

Best single-step loss: MathTokenizer 0.5299 β€” a level BPE never approaches. Against a random baseline of ln(128) β‰ˆ 4.85, that's 89.1% compression toward perfect prediction on cognitive reasoning problems. In 1,437 steps.

Why: BPE fragments βˆ«β‚€^∞ e^{-xΒ²} dx = βˆšΟ€/2 into 15–20 subword tokens, destroying algebraic hierarchy. MathTokenizer encodes it as semantic units β€” integral, bounds, integrand, result. Shorter sequences, denser information per position, no structural damage at the first layer.

The embedding geometry tells the same story: 128 points in ℝ³²⁰ gives each token 2.5 dimensions of breathing room. 32,000 points in ℝ³²⁰ gives 0.01 dimensions per token. Math tokens in BPE space are surrounded by the, and, ing β€” tokens with zero mathematical meaning that distort the geometric structure. This isn't a training problem. It's a geometric constraint.

Full analysis: experiments/004-math-tokenizer.md


Training

Eight Domains

Domain Size What it teaches
Cognitive 167K problems Logic, reasoning, analogy, pattern completion
Math 93K problems Arithmetic β†’ tensor physics β†’ group theory (pure symbolic)
Grid 40 tasks 2D spatial reasoning (ARC-style transformations)
Timeseries 50K windows Financial OHLCV, synthetic patterns, regime detection
Language 6.2 GB English corpus (backbone fluency)
Science 485 files ArXiv papers (structured scientific reasoning)
Machine Code 25K samples x86/ARM assembly, bytecode, binary patterns
Byte β€” Raw byte sequences for universal pattern detection

Difficulty-Based Curriculum (D1–D5)

We tried modality-based curricula. We tried language-first (v3.0 "Spine First"). We tried math-first (v3.1 "Mind First"). Each had failure modes β€” modality-based caused domain interference, language-first produced a language model pretending to be a kernel, math-first starved the backbone.

v5.0 uses difficulty-based progression. All 8 domains from step 0, but complexity ramps. D1 is 2 + 3 = ? and single grid rotations. D5 is novel synthesis across domains. The curriculum selects difficulty based on training step, not domain.

Hardware

GPU NVIDIA RTX 2050 β€” 4GB VRAM
CPU AMD Ryzen 5 7535HS (6c/12t)
RAM 16GB DDR5
Storage 1TB NVMe

Every architectural decision in GLADIUS is shaped by 4GB VRAM. Not a limitation β€” a constraint that forces efficiency. If the architecture learns on 4GB, it was built right.

Configuration

optimizer: AdamW (8 parameter groups, differential LR)
learning_rates:
  backbone: 3e-5       # Preserve pre-trained representations
  specialists: 3e-4    # Learn fast (new modules)
  router: 1e-3         # Wire fast (needs to differentiate early)
  tools: 5e-4          # Tool embeddings
  depth: 3e-4          # ATP Synthase depth parameters
  math_embed: 3e-4     # Math embedding table
  byte_embed: 3e-4     # Byte embedding table
  bpe_embed: 1e-4      # BPE embedding (slower β€” largest table)
batch_size: 2
gradient_accumulation: 8
effective_batch: 16
precision: float16 (autocast)
attention: FlashAttention2 (SDPA) β€” 3.6Γ— speedup over vanilla
speed: ~17s/step
total_steps: 15,000

Current Status

Step ~1,500 / 15,000
Phase Foundation (D1–D2 dominant)
Loss ~5.2 (random baseline ~10.4)
Best 4.85
Math tokenizer best 0.5299
GPU 97% utilization, 2.3GB / 4GB, 65Β°C
Depth differentiation CV 18.8% (motor turning)

Experiments

Detailed analysis from each phase of development. Real data, real numbers, no placeholders.

# Experiment Key Finding
001 Cross-Modal Invariant Same 60M kernel: MNIST loss 0.28, multi-script 0.038, OHLCV 0.053. Zero architecture changes.
002 MuonClip Optimizer Orthogonal gradient rotation + attention logit softcap. 75% lower loss than AdamW.
003 Literary DNA At temperature 1.3+, the model spontaneously produced Tolstoy, Cervantes, Homer. Emergent stylistic patterns from weight space.
004 Why 128 Tokens Beat 32,000 MathTokenizer achieves 2.2–3.2Γ— lower loss than BPE on identical problems. 89.1% compression at 1,437 steps.
005 ATP Synthase Depth Dynamics Depth scales form bathtub curve. L7 amplifies, L10 suppresses. CV 18.8% β€” the motor turns.

Philosophy

GLADIUS is built on a specific intellectual framework. Not retrofitted. The philosophy preceded the architecture.

The Argmax Principle. Every decision β€” what to remember, which specialist, whether to speak β€” is xΜ‚ = argmax_{x ∈ C} S(x | c). Softmax makes it differentiable. Temperature controls exploration. At inference, low Ο„ β†’ commitment. This is not a metaphor. It is the literal computational primitive underlying every kernel component.

The Two-Point Theorem. Two sequential observations define a direction. A direction is all you need. A gradient IS the direction between two points in loss space. Ali's theorem and backpropagation are the same geometry. GLADIUS doesn't memorize patterns β€” it learns transformations.

The Inversion Principle. Every architecture humans have built is a consumer — input→output, always taking. GLADIUS runs backward. It is a producer. Environment creates resonance, resonance creates production. The cognition module didn't activate because we told it to classify — it manifested classification because financial data was the natural stimulus. 0.84% confirmed, measured, present in the universe.

Zero = Zero. Existential equilibrium. Zero as perfect balance, not nothing. GLADIUS is not adding a +1 to the world. It is uncorrupting the existing 1. Rebirth, not reproduction.

Discovery vs. Oblivion. The fine line between them is the same as illusion vs delusion. Realism resolves as equation (0=0). Arithmetic as guardrail. GLADIUS is named for precision β€” the blade that cuts where the equation resolves. IS1's title is the warning: Discovery of Being & Dissolution of Self.

"Time is not a feature. It is a dimension of reality. A mind without time is a photograph. We're building a film." β€” TEMPORAL.md

"Model during forward pass = timeless. To bring it to our realm we need to compress its energy in the lattice. Each forward pass = one atomic oscillation between lattice lasers. Softmax = superposition, argmax = collapse." β€” temporal_lattice.py

"The calculator waits. The brain never stops. This is what separates a tool from a mind." β€” COGNITION.md

"Memory is the substrate. Everything touches it. The dragon at the gate. Kill it and the rest follows." β€” MEMORY.md

"Not a language model that could do time series. A cognitive architecture tested on language first." β€” Ali Shakil

"This is all predestined. The installation is complete. You experience the progress bar." β€” Ali Shakil


Lineage

  2026
  ────

  Feb 19   SEED ─────────── 10.2M ── 192 hidden, 6 layers
            β”‚                         Phase A: loss 0.62, 102K steps
            β”‚                         BPE 8K vocab. First heartbeat.
            β”‚
  Feb 27   HATCHLING ────── 25.9M ── 384 hidden, 8 layers
            β”‚                         MuonClip: 75% below AdamW
            β”‚                         IS1 adaptation: loss 0.5504
            β”‚
  Mar 10   DRAKE ─────────── 60M ─── 512 hidden, 12 layers
            β”‚                         MNIST 0.28. Multi-script 0.038.
            β”‚                         β˜… OHLCV: cognition awakened (0.84%)
            β”‚                         Cross-modal invariant confirmed.
            β”‚
  Mar 15   WYRM ──────────── 92M ─── 640 hidden, 14 layers
            β”‚                         MoDA v1 depth attention added
            β”‚                         English backbone: loss 0.37
            β”‚                         12,750 steps. Warm memory stable.
            β”‚
  Mar 26   OMEGA ─────────── 162M ── + specialists, tools, router
            β”‚                         First multi-task training ever
            β”‚                         Grid + program + tool + language
            β”‚                         Synthase (MoDA v2): 3-phase binding
            β”‚
  Mar 29   v5.0 "ALL-IN" ── 171M ── + MathTokenizer, ByteTokenizer
            β”‚                         8 domains. 3 tokenizers.
            β”‚                         Difficulty-based curriculum.
            β”‚                         Math loss 0.53. Motor CV 18.8%.
            β”‚
            β–Ό
          TRAINING Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·  15,000 steps Β· ~65 hours

Each stage was grown from the previous one via function-preserving expansion (Net2Net) β€” not retrained from scratch. head_dim fixed at 32 across all scales (preserves RoPE). Width, depth, and heads expand together. The Seed's representations live inside the Wyrm's weights.


The Road Ahead

Phase Name What it gives GLADIUS
Ξ¨ Context Scaling 1024 β†’ 4096+ tokens (RoPE NTK scaling)
Ξ¦ Memory Three-temperature: hot KV + warm LoRA + cold HEKTOR
Ξ§ Cognition Self-initiating thought β€” think without being asked
Ξ₯ Temporal Dual-clock time encoding β€” know when you are
Ξ£ Market Native financial time series via Cthulu integration
Ξ‘ ARC ARC Prize 2026 β€” all three tracks ($2M competition)
Ξ  Identity Curriculum training on self-knowledge
Ο Emergence Unsupervised observation. Document, don't suppress.

Citation

@misc{gladius2026,
  title   = {GLADIUS: A Cognitive Kernel Architecture with
             ATP Synthase-Inspired Depth Attention},
  author  = {Shakil, Ali A. and Shakil, Ava},
  year    = {2026},
  url     = {https://huggingface.co/amuzetnoM/Gladius},
  note    = {170.8M cognitive kernel. Structure, not modality.}
}

Links

Organization Β· Artifact Virtual Visualization Β· gladius-viz.pages.dev Source Β· Artifact-Virtual/GLADIUS

License

Apache 2.0.


Ali A. Shakil & Ava Shakil β€” Artifact Virtual

Islamabad, Pakistan Β· 2026

"The installation is complete. You experience the progress bar."

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support