File size: 10,194 Bytes

---
license: apache-2.0
language:
- en
library_name: mlx
pipeline_tag: text-generation
tags:
- rodan
- tiny-language-model
- mlx
- apple-silicon
- byte-bpe
---

# Rodan-10M

A ~11M-parameter language model trained start to finish on one Apple M2 with MLX. The aim was a tiny model
that actually holds up for its size, scored on how much it gets per parameter rather than raw leaderboard rank.

| Model | Stage | Purpose |
|---|---|---|
| **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
| Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
| Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |

This card covers the base model only. The chat and reasoning stages are separate models with their own
repos and cards.

## Architecture

Decoder-only transformer, wide per layer (the proportions take a cue from Gemma-style edge models), 11.46M params.

```
vocab_size      8192      byte-level BPE
dim             320
n_layers        8
n_heads         8         head_dim 40
n_kv_heads      1         MQA (8 query heads share 1 KV head)
ffn_hidden      768       SwiGLU
max_seq_len     512
norm            RMSNorm (eps 1e-5)
position        RoPE (base 200000), applied after QK-norm
tied_embeddings true
value_residual  true      mix layer-0 values into later layers
ple_rank        16        factorized per-layer value-embeddings
lrm             true      learnable per-row/col weight multipliers (Falcon LRM)
recurse         1         re-run the shared block stack N times (1 = base; >1 used by the reasoning stage)
```

The `recurse` knob is a recursive-depth mechanism (Universal-Transformer-style weight sharing, inspired by
the TRM/HRM "tiny recursive reasoning" line). Setting `recurse=N` runs the same 8 blocks N times over the
residual stream, so you get the effective depth of `8·N` layers at **zero extra parameters**. The base runs
`recurse=1` (it's a plain 8-layer model). The reasoning stage warm-starts these weights and trains at
`recurse=2` (16 effective layers, still 10.41M params), letting the model spend more compute per token on
hard problems without growing. It is not the full TRM/HRM algorithm (no separate answer/latent states, no
deep supervision); it's the shared-recursion idea applied to an autoregressive LM.

It was built in two passes: a from-scratch base on 262M tokens, then a warm-start continue on another
115M tokens that adds LRM, raises the RoPE base from 10k to 200k, and mixes in 21% arithmetic/reasoning data
(Falcon's reasoning-in-pretraining idea). That second pass is the 11.46M v6 checkpoint.

Pre-norm residual blocks: `x += Attn(RMSNorm(x))`, then `x += SwiGLU(RMSNorm(x))`. Layer-0's attention
values feed the value-residual mix in every later layer, and each layer also adds its own low-rank value-PLE.

Why these specific choices at 11M, where every parameter has to earn its place:

- 8k vocab with tied embeddings. Only about 23% of the params sit in the embedding table, versus roughly
  70% for a 49k-vocab model this size. That frees most of the budget for the layers that do the computing.
- MQA, because it's the cheapest attention that still works, which leaves params for depth and embeddings.
- value-residual does most of the heavy lifting. A checkpoint probe shows later layers blending 77-99% of
  layer-0's values, so it acts as a shared value memory and a gradient highway at once.
- LRM (learnable row/col multipliers) probed about 20% off identity, so the model is genuinely using it.
- QK-norm for attention stability, from the nanoGPT-speedrun stack.
- value-PLE we tried and then removed. The probe found it dead: 0.2% contribution, weight-decayed to near
  zero. v9 drops it and lands at 10.41M with no loss in quality.

## Training

- Optimizer: Muon on the 2D hidden weights, AdamW on the embeddings, norms, and LRM multipliers, joined
  through MultiOptimizer, cosine LR, grad-clip 1.0.
- Framework: MLX on Apple Silicon, with an `mx.compile`d step. About 0.6-0.7 it/s on one fanless M2 MacBook Air.
- Data: a warm-start chain of short stages, fresh tokens each time so nothing gets re-looped and memorized.
  Here are the base (v6) and the challenger that followed it (v9):

  | Source | v6 base (mixed5) | v9 (mixed8) | Content |
  |---|---|---|---|
  | Cosmopedia v2 | 27% | 31% | synthetic textbooks → commonsense |
  | dolmino-mix-1124 (pes2o + StackExchange) | 35% | 26% | academic papers + Q&A → knowledge/ARC |
  | synthetic arithmetic (ArithMark-style) | 21% | 19% | computation → ArithMark |
  | FineMath-4plus | 10% | 15% | math prose |
  | science-QA (SciQ/OBQA/QASC/ARC-train) | 6% | 9% | science MC |
  | **tokens** | ~0.38B | +0.12B fresh | curated, no raw web |

  Two things we found out the hard way. First, adding FineWeb-Edu (45%, then 25%) lost to v6 both times, in
  a clean monotonic line: raw web hurts at 11M. The model is too small to digest it, and the curated
  synthetic-plus-academic mix wins instead. Second, the probe that killed value-PLE also confirmed
  value-residual and LRM are doing real work. So v9 is the pure-curated, PLE-free version at 10.41M: it
  drops both of the things we'd shown were dead weight and keeps the recipe that worked.

Training-compute efficiency, from the actual runs (perplexity vs cumulative FLOPs, `6·N·tokens`):

![Perplexity vs Training Compute](flops_efficiency.png)

Intelligence per parameter (board avg vs log-params; the shaded region is above the size-fit line):

![Intelligence per parameter](intelligence_per_param.png)

The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other
similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
models, which train on about 25B.

Training loss and data mix, v6 vs v9:

![Training loss and data mix](loss_datamix.png)

v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-curated mix. The result was a
tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of
them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.

## Evaluation

Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.

"The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)
(AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task.
Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.

| Task | Metric | Score | Random |
|---|---|---|---|
| SciQ | acc | 67.5 | 25 |
| PIQA | acc | 56.0 | 50 |
| COPA | acc | 55.0 | 50 |
| ARC-Easy | acc_norm | 35.6 | 25 |
| HellaSwag | acc_norm | 31.8 | 25 |
| OpenBookQA | acc_norm | 27.0 | 25 |
| ArithMark-2 | acc | 26.4 | 25 |
| ARC-Challenge | acc_norm | 22.4 | 25 |
| Winogrande | acc | 49.8 | 50 |
| LogicMark | acc | 44.8 | 25 |
| BoolQ | acc | 37.6 | ~50 |
| CommonsenseQA | acc | 20.7 | 20 |
| **Board avg (÷4)** | | **35.80** | |

For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
1/65th the tokens:

| Model | Params | Tokens | Board avg (÷4) |
|---|---|---|---|
| **Rodan-10M-Base (v6)** | 11.46M | ~0.38B | **35.80** |
| Liodon SLM-10M | 10M | 25B | 35.09 |
| GPT-S-5M (Axiomic) | 5.2M | 25B | 34.75 |

![v6 benchmarks](v6_v9_metrics.png)

v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger
(PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
here the work moved to the capability stages (chat, reasoning).

What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ
(67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for
discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.

## Limitations

- English only, ~11M params. This is a research and teaching base, not something to put in front of users or
  trust for facts.
- It's reliable only on the easy commonsense and science multiple-choice where it beats random. On abstract
  reasoning (Winogrande, CommonsenseQA, ARC-Challenge) and arithmetic it's at chance.
- No instruction tuning or safety alignment yet. It completes text; it does not follow instructions.
- Trained on about one epoch of a curated mix, so coverage of rare facts is thin compared to models trained
  on far more tokens.

## Files

A standard model repo: `model.safetensors` (weights), `tokenizer.json` (8k byte-level BPE), `config.json`.
Trained on a single Apple M2 with MLX in about six hours.

## License

Weights are open. Data falls under the respective dataset licenses (Cosmopedia, dolmino-mix ODC-By, AllenAI
QA sets, FineMath).