Matrix SSM 28.3M — BabyLM 2026 Strict

A ~28.3M-parameter Matrix SSM (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.

Model Details

Architecture: Single-Scale Matrix SSM (20 layers; fine-scale streams with depthwise causal 1D conv; SwiGLU MLP; QK-Norm; no attention, no KV cache)
Parameters: 28,339,320
Layers: 20
Hidden dim: 384
Streams: 6, stream dim 32, 6 read heads (state capacity: 6,144 floats/layer)
Vocab: 10,240 BPE (trained on BabyLM corpus)
Context: 512 tokens
Training: 10 epochs, 1.9B tokens (67.0× params), Cosine decay schedule, MuonAdamW optimizer

Comparison of Model Versions

The architecture has evolved from a Multi-Scale baseline (v8) to a deep Single-Scale vector model (v9), and finally to a Matrix SSM with vocabulary budget optimization (v13):

Metric / Parameter	Multi-Scale v8	Single-Scale v9	Matrix SSM v13 (New SOTA)
Architecture	Vector streams + stride-4 Coarse	Single-Scale Vector SSM	Matrix SSM (Quadratic associative state)
Parameters	27,544,336	28,662,144	28,339,320
Layers	8	16	20 (+25% depth)
State Capacity	4,608 floats / layer	4,608 floats / layer	6,144 floats / layer (+50% capacity)
Vocab Size	16,384	16,384	10,240 (parameter-reclaiming)
LR Schedule	WSD (Cosine tail)	WSD (Cosine tail)	Pure Cosine Decay
QK-Norm	No	No	Yes (Stable projections)
Val Loss (Shuffled)	3.4352	3.3719	3.2844
Val Perplexity	31.04	29.13	26.69 (-8.37% drop vs v9)

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/matrix-ssm-28m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/matrix-ssm-28m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Track: Strict (100M words, 10 epochs max)
Eval repo: babylm-eval
Leaderboard: BabyLM-Leaderboard-2026

Results (zero-shot, causal, temperature 1.0)

Task	Score (Vanilla Baseline)	Score (Multi-Scale v6)	Score (Single-Scale v9)	Score (Matrix SSM v13)
BLiMP	62.87	66.78	69.99	69.72
EWOK (supplement)	49.64	54.33	51.54	54.46
VQA (EWoK)	52.76	52.35	53.33	52.86
Entity Tracking	17.90	17.41	17.45	39.75
Comps	52.40	52.45	53.84	54.67
Reading (eye tracking)	0.93	1.52	0.41	0.21
Reading (self-paced)	0.14	0.02	0.53	0.00

Results (fine-tuning, GLUE)

Task	Metric	Score (Vanilla Baseline)	Score (Multi-Scale v6)	Score (Single-Scale v9)	Score (Matrix SSM v13)
BOOLQ	accuracy	63.8	64.59	64.46	64.22
MULTIRC	accuracy	58.5	56.93	Pending / TBD	57.10
RTE	accuracy	61.2	59.71	Pending / TBD	51.08
WSC	accuracy	63.5	63.46	Pending / TBD	57.69
MRPC	f1	69.6	82.25	Pending / TBD	81.55
QQP	f1	69.6	54.98	Pending / TBD	55.71
MNLI	accuracy	43.6	44.68	Pending / TBD	45.29 (WIN)

Architecture

The Matrix SSM replaces self-attention with a linear-time associative state recurrence:

Input tokens are processed by a depthwise causal 1D convolution (kernel_size=4) to inject local temporal context.
The model maintains a matrix state $S_t \in \mathbb{R}^{H \times d_k \times d_v}$ storing key-value associations quadratically per head (equivalent to linear attention with data-dependent decay $\alpha_t$).
Queries and Keys are stabilized using parameter-free QK-Normalization (RMSNorm) before querying the associative memory.
A SwiGLU MLP feedforward layer processes the mixed representations.

This yields O(n) training complexity per token and constant-time $O(1)$ decoding (with no KV cache) using model.step().

Training Details

Parameter	Value
Optimizer	MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule	Cosine Decay (warmup 500 steps, Cosine cooldown)
Epochs	10 epochs (max allowed is 10)
Peak LR	0.0100 (muon), 0.0005 (adamw)
Weight decay	0.1
Batch size	512 × 1 accum = 512 effective × 512 tokens (262k tokens/step)
Total steps	7,247
GPU	NVIDIA A100 GPU (~3.7h training run)
Val loss	3.2844 (final validation loss on globally shuffled splits)
Data cleaning	CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

ecreeth
/

matrix-ssm-28m-babylm