Matrix SSM 28.3M โ€” BabyLM 2026 Strict

A ~28.3M-parameter Matrix SSM (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.

Model Details

  • Architecture: Single-Scale Matrix SSM (20 layers; fine-scale streams with depthwise causal 1D conv; SwiGLU MLP; QK-Norm; no attention, no KV cache)
  • Parameters: 28,339,320
  • Layers: 20
  • Hidden dim: 384
  • Streams: 6, stream dim 32, 6 read heads (state capacity: 6,144 floats/layer)
  • Vocab: 10,240 BPE (trained on BabyLM corpus)
  • Context: 512 tokens
  • Training: 10 epochs, 1.9B tokens (67.0ร— params), Cosine decay schedule, MuonAdamW optimizer

Comparison of Model Versions

The architecture has evolved from a Multi-Scale baseline (v8) to a deep Single-Scale vector model (v9), and finally to a Matrix SSM with vocabulary budget optimization (v13):

Metric / Parameter Multi-Scale v8 Single-Scale v9 Matrix SSM v13 (New SOTA)
Architecture Vector streams + stride-4 Coarse Single-Scale Vector SSM Matrix SSM (Quadratic associative state)
Parameters 27,544,336 28,662,144 28,339,320
Layers 8 16 20 (+25% depth)
State Capacity 4,608 floats / layer 4,608 floats / layer 6,144 floats / layer (+50% capacity)
Vocab Size 16,384 16,384 10,240 (parameter-reclaiming)
LR Schedule WSD (Cosine tail) WSD (Cosine tail) Pure Cosine Decay
QK-Norm No No Yes (Stable projections)
Val Loss (Shuffled) 3.4352 3.3719 3.2844
Val Perplexity 31.04 29.13 26.69 (-8.37% drop vs v9)

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/matrix-ssm-28m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/matrix-ssm-28m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Results (zero-shot, causal, temperature 1.0)

Task Score (Vanilla Baseline) Score (Multi-Scale v6) Score (Single-Scale v9) Score (Matrix SSM v13)
BLiMP 62.87 66.78 69.99 69.72
EWOK (supplement) 49.64 54.33 51.54 54.46
VQA (EWoK) 52.76 52.35 53.33 52.86
Entity Tracking 17.90 17.41 17.45 39.75
Comps 52.40 52.45 53.84 54.67
Reading (eye tracking) 0.93 1.52 0.41 0.21
Reading (self-paced) 0.14 0.02 0.53 0.00

Results (fine-tuning, GLUE)

Task Metric Score (Vanilla Baseline) Score (Multi-Scale v6) Score (Single-Scale v9) Score (Matrix SSM v13)
BOOLQ accuracy 63.8 64.59 64.46 64.22
MULTIRC accuracy 58.5 56.93 Pending / TBD 57.10
RTE accuracy 61.2 59.71 Pending / TBD 51.08
WSC accuracy 63.5 63.46 Pending / TBD 57.69
MRPC f1 69.6 82.25 Pending / TBD 81.55
QQP f1 69.6 54.98 Pending / TBD 55.71
MNLI accuracy 43.6 44.68 Pending / TBD 45.29 (WIN)

Architecture

The Matrix SSM replaces self-attention with a linear-time associative state recurrence:

  1. Input tokens are processed by a depthwise causal 1D convolution (kernel_size=4) to inject local temporal context.
  2. The model maintains a matrix state $S_t \in \mathbb{R}^{H \times d_k \times d_v}$ storing key-value associations quadratically per head (equivalent to linear attention with data-dependent decay $\alpha_t$).
  3. Queries and Keys are stabilized using parameter-free QK-Normalization (RMSNorm) before querying the associative memory.
  4. A SwiGLU MLP feedforward layer processes the mixed representations.

This yields O(n) training complexity per token and constant-time $O(1)$ decoding (with no KV cache) using model.step().

Training Details

Parameter Value
Optimizer MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule Cosine Decay (warmup 500 steps, Cosine cooldown)
Epochs 10 epochs (max allowed is 10)
Peak LR 0.0100 (muon), 0.0005 (adamw)
Weight decay 0.1
Batch size 512 ร— 1 accum = 512 effective ร— 512 tokens (262k tokens/step)
Total steps 7,247
GPU NVIDIA A100 GPU (~3.7h training run)
Val loss 3.2844 (final validation loss on globally shuffled splits)
Data cleaning CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

Links

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support