Text Generation
MLX
Safetensors
English
rodan-modern
rodan
tiny-language-model
apple-silicon
byte-bpe
Instructions to use bfuzzy1/Rodan-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bfuzzy1/Rodan-Base with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("bfuzzy1/Rodan-Base") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use bfuzzy1/Rodan-Base with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "bfuzzy1/Rodan-Base" --prompt "Once upon a time"
File size: 10,194 Bytes
f6922f1 d62ba2f 558428f d62ba2f 743f8c2 d62ba2f 743f8c2 ec9d71f f33ebff d62ba2f f33ebff ec9d71f f33ebff d62ba2f 743f8c2 d62ba2f ec9d71f 743f8c2 d62ba2f ec9d71f 743f8c2 f33ebff d62ba2f ec9d71f 46e510c d62ba2f f33ebff 743f8c2 d62ba2f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
license: apache-2.0
language:
- en
library_name: mlx
pipeline_tag: text-generation
tags:
- rodan
- tiny-language-model
- mlx
- apple-silicon
- byte-bpe
---
# Rodan-10M
A ~11M-parameter language model trained start to finish on one Apple M2 with MLX. The aim was a tiny model
that actually holds up for its size, scored on how much it gets per parameter rather than raw leaderboard rank.
| Model | Stage | Purpose |
|---|---|---|
| **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
| Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
| Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |
This card covers the base model only. The chat and reasoning stages are separate models with their own
repos and cards.
## Architecture
Decoder-only transformer, wide per layer (the proportions take a cue from Gemma-style edge models), 11.46M params.
```
vocab_size 8192 byte-level BPE
dim 320
n_layers 8
n_heads 8 head_dim 40
n_kv_heads 1 MQA (8 query heads share 1 KV head)
ffn_hidden 768 SwiGLU
max_seq_len 512
norm RMSNorm (eps 1e-5)
position RoPE (base 200000), applied after QK-norm
tied_embeddings true
value_residual true mix layer-0 values into later layers
ple_rank 16 factorized per-layer value-embeddings
lrm true learnable per-row/col weight multipliers (Falcon LRM)
recurse 1 re-run the shared block stack N times (1 = base; >1 used by the reasoning stage)
```
The `recurse` knob is a recursive-depth mechanism (Universal-Transformer-style weight sharing, inspired by
the TRM/HRM "tiny recursive reasoning" line). Setting `recurse=N` runs the same 8 blocks N times over the
residual stream, so you get the effective depth of `8·N` layers at **zero extra parameters**. The base runs
`recurse=1` (it's a plain 8-layer model). The reasoning stage warm-starts these weights and trains at
`recurse=2` (16 effective layers, still 10.41M params), letting the model spend more compute per token on
hard problems without growing. It is not the full TRM/HRM algorithm (no separate answer/latent states, no
deep supervision); it's the shared-recursion idea applied to an autoregressive LM.
It was built in two passes: a from-scratch base on 262M tokens, then a warm-start continue on another
115M tokens that adds LRM, raises the RoPE base from 10k to 200k, and mixes in 21% arithmetic/reasoning data
(Falcon's reasoning-in-pretraining idea). That second pass is the 11.46M v6 checkpoint.
Pre-norm residual blocks: `x += Attn(RMSNorm(x))`, then `x += SwiGLU(RMSNorm(x))`. Layer-0's attention
values feed the value-residual mix in every later layer, and each layer also adds its own low-rank value-PLE.
Why these specific choices at 11M, where every parameter has to earn its place:
- 8k vocab with tied embeddings. Only about 23% of the params sit in the embedding table, versus roughly
70% for a 49k-vocab model this size. That frees most of the budget for the layers that do the computing.
- MQA, because it's the cheapest attention that still works, which leaves params for depth and embeddings.
- value-residual does most of the heavy lifting. A checkpoint probe shows later layers blending 77-99% of
layer-0's values, so it acts as a shared value memory and a gradient highway at once.
- LRM (learnable row/col multipliers) probed about 20% off identity, so the model is genuinely using it.
- QK-norm for attention stability, from the nanoGPT-speedrun stack.
- value-PLE we tried and then removed. The probe found it dead: 0.2% contribution, weight-decayed to near
zero. v9 drops it and lands at 10.41M with no loss in quality.
## Training
- Optimizer: Muon on the 2D hidden weights, AdamW on the embeddings, norms, and LRM multipliers, joined
through MultiOptimizer, cosine LR, grad-clip 1.0.
- Framework: MLX on Apple Silicon, with an `mx.compile`d step. About 0.6-0.7 it/s on one fanless M2 MacBook Air.
- Data: a warm-start chain of short stages, fresh tokens each time so nothing gets re-looped and memorized.
Here are the base (v6) and the challenger that followed it (v9):
| Source | v6 base (mixed5) | v9 (mixed8) | Content |
|---|---|---|---|
| Cosmopedia v2 | 27% | 31% | synthetic textbooks → commonsense |
| dolmino-mix-1124 (pes2o + StackExchange) | 35% | 26% | academic papers + Q&A → knowledge/ARC |
| synthetic arithmetic (ArithMark-style) | 21% | 19% | computation → ArithMark |
| FineMath-4plus | 10% | 15% | math prose |
| science-QA (SciQ/OBQA/QASC/ARC-train) | 6% | 9% | science MC |
| **tokens** | ~0.38B | +0.12B fresh | curated, no raw web |
Two things we found out the hard way. First, adding FineWeb-Edu (45%, then 25%) lost to v6 both times, in
a clean monotonic line: raw web hurts at 11M. The model is too small to digest it, and the curated
synthetic-plus-academic mix wins instead. Second, the probe that killed value-PLE also confirmed
value-residual and LRM are doing real work. So v9 is the pure-curated, PLE-free version at 10.41M: it
drops both of the things we'd shown were dead weight and keeps the recipe that worked.
Training-compute efficiency, from the actual runs (perplexity vs cumulative FLOPs, `6·N·tokens`):

Intelligence per parameter (board avg vs log-params; the shaded region is above the size-fit line):

The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other
similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
models, which train on about 25B.
Training loss and data mix, v6 vs v9:

v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-curated mix. The result was a
tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of
them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
## Evaluation
Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.
"The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)
(AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task.
Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
| Task | Metric | Score | Random |
|---|---|---|---|
| SciQ | acc | 67.5 | 25 |
| PIQA | acc | 56.0 | 50 |
| COPA | acc | 55.0 | 50 |
| ARC-Easy | acc_norm | 35.6 | 25 |
| HellaSwag | acc_norm | 31.8 | 25 |
| OpenBookQA | acc_norm | 27.0 | 25 |
| ArithMark-2 | acc | 26.4 | 25 |
| ARC-Challenge | acc_norm | 22.4 | 25 |
| Winogrande | acc | 49.8 | 50 |
| LogicMark | acc | 44.8 | 25 |
| BoolQ | acc | 37.6 | ~50 |
| CommonsenseQA | acc | 20.7 | 20 |
| **Board avg (÷4)** | | **35.80** | |
For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
1/65th the tokens:
| Model | Params | Tokens | Board avg (÷4) |
|---|---|---|---|
| **Rodan-10M-Base (v6)** | 11.46M | ~0.38B | **35.80** |
| Liodon SLM-10M | 10M | 25B | 35.09 |
| GPT-S-5M (Axiomic) | 5.2M | 25B | 34.75 |

v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger
(PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
here the work moved to the capability stages (chat, reasoning).
What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ
(67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for
discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
## Limitations
- English only, ~11M params. This is a research and teaching base, not something to put in front of users or
trust for facts.
- It's reliable only on the easy commonsense and science multiple-choice where it beats random. On abstract
reasoning (Winogrande, CommonsenseQA, ARC-Challenge) and arithmetic it's at chance.
- No instruction tuning or safety alignment yet. It completes text; it does not follow instructions.
- Trained on about one epoch of a curated mix, so coverage of rare facts is thin compared to models trained
on far more tokens.
## Files
A standard model repo: `model.safetensors` (weights), `tokenizer.json` (8k byte-level BPE), `config.json`.
Trained on a single Apple M2 with MLX in about six hours.
## License
Weights are open. Data falls under the respective dataset licenses (Cosmopedia, dolmino-mix ODC-By, AllenAI
QA sets, FineMath).
|