Matrix SSM 28.3M โ BabyLM 2026 Strict
A ~28.3M-parameter Matrix SSM (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.
Model Details
- Architecture: Single-Scale Matrix SSM (20 layers; fine-scale streams with depthwise causal 1D conv; SwiGLU MLP; QK-Norm; no attention, no KV cache)
- Parameters: 28,339,320
- Layers: 20
- Hidden dim: 384
- Streams: 6, stream dim 32, 6 read heads (state capacity: 6,144 floats/layer)
- Vocab: 10,240 BPE (trained on BabyLM corpus)
- Context: 512 tokens
- Training: 10 epochs, 1.9B tokens (67.0ร params), Cosine decay schedule, MuonAdamW optimizer
Comparison of Model Versions
The architecture has evolved from a Multi-Scale baseline (v8) to a deep Single-Scale vector model (v9), and finally to a Matrix SSM with vocabulary budget optimization (v13):
| Metric / Parameter | Multi-Scale v8 | Single-Scale v9 | Matrix SSM v13 (New SOTA) |
|---|---|---|---|
| Architecture | Vector streams + stride-4 Coarse | Single-Scale Vector SSM | Matrix SSM (Quadratic associative state) |
| Parameters | 27,544,336 | 28,662,144 | 28,339,320 |
| Layers | 8 | 16 | 20 (+25% depth) |
| State Capacity | 4,608 floats / layer | 4,608 floats / layer | 6,144 floats / layer (+50% capacity) |
| Vocab Size | 16,384 | 16,384 | 10,240 (parameter-reclaiming) |
| LR Schedule | WSD (Cosine tail) | WSD (Cosine tail) | Pure Cosine Decay |
| QK-Norm | No | No | Yes (Stable projections) |
| Val Loss (Shuffled) | 3.4352 | 3.3719 | 3.2844 |
| Val Perplexity | 31.04 | 29.13 | 26.69 (-8.37% drop vs v9) |
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ecreeth/matrix-ssm-28m-babylm")
model = AutoModelForCausalLM.from_pretrained(
"ecreeth/matrix-ssm-28m-babylm",
trust_remote_code=True,
)
inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
BabyLM Challenge
- Track: Strict (100M words, 10 epochs max)
- Eval repo: babylm-eval
- Leaderboard: BabyLM-Leaderboard-2026
Results (zero-shot, causal, temperature 1.0)
| Task | Score (Vanilla Baseline) | Score (Multi-Scale v6) | Score (Single-Scale v9) | Score (Matrix SSM v13) |
|---|---|---|---|---|
| BLiMP | 62.87 | 66.78 | 69.99 | 69.72 |
| EWOK (supplement) | 49.64 | 54.33 | 51.54 | 54.46 |
| VQA (EWoK) | 52.76 | 52.35 | 53.33 | 52.86 |
| Entity Tracking | 17.90 | 17.41 | 17.45 | 39.75 |
| Comps | 52.40 | 52.45 | 53.84 | 54.67 |
| Reading (eye tracking) | 0.93 | 1.52 | 0.41 | 0.21 |
| Reading (self-paced) | 0.14 | 0.02 | 0.53 | 0.00 |
Results (fine-tuning, GLUE)
| Task | Metric | Score (Vanilla Baseline) | Score (Multi-Scale v6) | Score (Single-Scale v9) | Score (Matrix SSM v13) |
|---|---|---|---|---|---|
| BOOLQ | accuracy | 63.8 | 64.59 | 64.46 | 64.22 |
| MULTIRC | accuracy | 58.5 | 56.93 | Pending / TBD | 57.10 |
| RTE | accuracy | 61.2 | 59.71 | Pending / TBD | 51.08 |
| WSC | accuracy | 63.5 | 63.46 | Pending / TBD | 57.69 |
| MRPC | f1 | 69.6 | 82.25 | Pending / TBD | 81.55 |
| QQP | f1 | 69.6 | 54.98 | Pending / TBD | 55.71 |
| MNLI | accuracy | 43.6 | 44.68 | Pending / TBD | 45.29 (WIN) |
Architecture
The Matrix SSM replaces self-attention with a linear-time associative state recurrence:
- Input tokens are processed by a depthwise causal 1D convolution (kernel_size=4) to inject local temporal context.
- The model maintains a matrix state $S_t \in \mathbb{R}^{H \times d_k \times d_v}$ storing key-value associations quadratically per head (equivalent to linear attention with data-dependent decay $\alpha_t$).
- Queries and Keys are stabilized using parameter-free QK-Normalization (RMSNorm) before querying the associative memory.
- A SwiGLU MLP feedforward layer processes the mixed representations.
This yields O(n) training complexity per token and constant-time $O(1)$ decoding (with no KV cache) using model.step().
Training Details
| Parameter | Value |
|---|---|
| Optimizer | MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases) |
| LR schedule | Cosine Decay (warmup 500 steps, Cosine cooldown) |
| Epochs | 10 epochs (max allowed is 10) |
| Peak LR | 0.0100 (muon), 0.0005 (adamw) |
| Weight decay | 0.1 |
| Batch size | 512 ร 1 accum = 512 effective ร 512 tokens (262k tokens/step) |
| Total steps | 7,247 |
| GPU | NVIDIA A100 GPU (~3.7h training run) |
| Val loss | 3.2844 (final validation loss on globally shuffled splits) |
| Data cleaning | CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered |
Links
- Downloads last month
- 30