Upload README.md with huggingface_hub

ec9d71f verified 10 days ago

10.2 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: mlx
	pipeline_tag: text-generation
	tags:
	- rodan
	- tiny-language-model
	- mlx
	- apple-silicon
	- byte-bpe
	---

	# Rodan-10M

	A ~11M-parameter language model trained start to finish on one Apple M2 with MLX. The aim was a tiny model
	that actually holds up for its size, scored on how much it gets per parameter rather than raw leaderboard rank.

	\| Model \| Stage \| Purpose \|
	\|---\|---\|---\|
	\| Rodan-10M-Base \| pretraining \| foundation: commonsense + knowledge \|
	\| Rodan-10M-Chat (released) \| instruction fold \| chat / instruction following \|
	\| Rodan-10M-Reasoning (released) \| recursive depth + CoT fold + DPO \| verifiable math + reasoning \|

	This card covers the base model only. The chat and reasoning stages are separate models with their own
	repos and cards.

	## Architecture

	Decoder-only transformer, wide per layer (the proportions take a cue from Gemma-style edge models), 11.46M params.

	```
	vocab_size 8192 byte-level BPE
	dim 320
	n_layers 8
	n_heads 8 head_dim 40
	n_kv_heads 1 MQA (8 query heads share 1 KV head)
	ffn_hidden 768 SwiGLU
	max_seq_len 512
	norm RMSNorm (eps 1e-5)
	position RoPE (base 200000), applied after QK-norm
	tied_embeddings true
	value_residual true mix layer-0 values into later layers
	ple_rank 16 factorized per-layer value-embeddings
	lrm true learnable per-row/col weight multipliers (Falcon LRM)
	recurse 1 re-run the shared block stack N times (1 = base; >1 used by the reasoning stage)
	```

	The `recurse` knob is a recursive-depth mechanism (Universal-Transformer-style weight sharing, inspired by
	the TRM/HRM "tiny recursive reasoning" line). Setting `recurse=N` runs the same 8 blocks N times over the
	residual stream, so you get the effective depth of `8·N` layers at zero extra parameters. The base runs
	`recurse=1` (it's a plain 8-layer model). The reasoning stage warm-starts these weights and trains at
	`recurse=2` (16 effective layers, still 10.41M params), letting the model spend more compute per token on
	hard problems without growing. It is not the full TRM/HRM algorithm (no separate answer/latent states, no
	deep supervision); it's the shared-recursion idea applied to an autoregressive LM.

	It was built in two passes: a from-scratch base on 262M tokens, then a warm-start continue on another
	115M tokens that adds LRM, raises the RoPE base from 10k to 200k, and mixes in 21% arithmetic/reasoning data
	(Falcon's reasoning-in-pretraining idea). That second pass is the 11.46M v6 checkpoint.

	Pre-norm residual blocks: `x += Attn(RMSNorm(x))`, then `x += SwiGLU(RMSNorm(x))`. Layer-0's attention
	values feed the value-residual mix in every later layer, and each layer also adds its own low-rank value-PLE.

	Why these specific choices at 11M, where every parameter has to earn its place:

	- 8k vocab with tied embeddings. Only about 23% of the params sit in the embedding table, versus roughly
	70% for a 49k-vocab model this size. That frees most of the budget for the layers that do the computing.
	- MQA, because it's the cheapest attention that still works, which leaves params for depth and embeddings.
	- value-residual does most of the heavy lifting. A checkpoint probe shows later layers blending 77-99% of
	layer-0's values, so it acts as a shared value memory and a gradient highway at once.
	- LRM (learnable row/col multipliers) probed about 20% off identity, so the model is genuinely using it.
	- QK-norm for attention stability, from the nanoGPT-speedrun stack.
	- value-PLE we tried and then removed. The probe found it dead: 0.2% contribution, weight-decayed to near
	zero. v9 drops it and lands at 10.41M with no loss in quality.

	## Training

	- Optimizer: Muon on the 2D hidden weights, AdamW on the embeddings, norms, and LRM multipliers, joined
	through MultiOptimizer, cosine LR, grad-clip 1.0.
	- Framework: MLX on Apple Silicon, with an `mx.compile`d step. About 0.6-0.7 it/s on one fanless M2 MacBook Air.
	- Data: a warm-start chain of short stages, fresh tokens each time so nothing gets re-looped and memorized.
	Here are the base (v6) and the challenger that followed it (v9):

	\| Source \| v6 base (mixed5) \| v9 (mixed8) \| Content \|
	\|---\|---\|---\|---\|
	\| Cosmopedia v2 \| 27% \| 31% \| synthetic textbooks → commonsense \|
	\| dolmino-mix-1124 (pes2o + StackExchange) \| 35% \| 26% \| academic papers + Q&A → knowledge/ARC \|
	\| synthetic arithmetic (ArithMark-style) \| 21% \| 19% \| computation → ArithMark \|
	\| FineMath-4plus \| 10% \| 15% \| math prose \|
	\| science-QA (SciQ/OBQA/QASC/ARC-train) \| 6% \| 9% \| science MC \|
	\| tokens \| ~0.38B \| +0.12B fresh \| curated, no raw web \|

	Two things we found out the hard way. First, adding FineWeb-Edu (45%, then 25%) lost to v6 both times, in
	a clean monotonic line: raw web hurts at 11M. The model is too small to digest it, and the curated
	synthetic-plus-academic mix wins instead. Second, the probe that killed value-PLE also confirmed
	value-residual and LRM are doing real work. So v9 is the pure-curated, PLE-free version at 10.41M: it
	drops both of the things we'd shown were dead weight and keeps the recipe that worked.

	Training-compute efficiency, from the actual runs (perplexity vs cumulative FLOPs, `6·N·tokens`):

	![Perplexity vs Training Compute](flops_efficiency.png)

	Intelligence per parameter (board avg vs log-params; the shaded region is above the size-fit line):

	![Intelligence per parameter](intelligence_per_param.png)

	The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
	sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other
	similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
	models, which train on about 25B.

	Training loss and data mix, v6 vs v9:

	![Training loss and data mix](loss_datamix.png)

	v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-curated mix. The result was a
	tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
	gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
	showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
	out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
	ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of
	them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
	small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.

	## Evaluation

	Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
	length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.

	"The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)
	(AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task.
	Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.

	\| Task \| Metric \| Score \| Random \|
	\|---\|---\|---\|---\|
	\| SciQ \| acc \| 67.5 \| 25 \|
	\| PIQA \| acc \| 56.0 \| 50 \|
	\| COPA \| acc \| 55.0 \| 50 \|
	\| ARC-Easy \| acc_norm \| 35.6 \| 25 \|
	\| HellaSwag \| acc_norm \| 31.8 \| 25 \|
	\| OpenBookQA \| acc_norm \| 27.0 \| 25 \|
	\| ArithMark-2 \| acc \| 26.4 \| 25 \|
	\| ARC-Challenge \| acc_norm \| 22.4 \| 25 \|
	\| Winogrande \| acc \| 49.8 \| 50 \|
	\| LogicMark \| acc \| 44.8 \| 25 \|
	\| BoolQ \| acc \| 37.6 \| ~50 \|
	\| CommonsenseQA \| acc \| 20.7 \| 20 \|
	\| Board avg (÷4) \| \| 35.80 \| \|

	For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
	1/65th the tokens:

	\| Model \| Params \| Tokens \| Board avg (÷4) \|
	\|---\|---\|---\|---\|
	\| Rodan-10M-Base (v6) \| 11.46M \| ~0.38B \| 35.80 \|
	\| Liodon SLM-10M \| 10M \| 25B \| 35.09 \|
	\| GPT-S-5M (Axiomic) \| 5.2M \| 25B \| 34.75 \|

	![v6 benchmarks](v6_v9_metrics.png)

	v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger
	(PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
	v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
	here the work moved to the capability stages (chat, reasoning).

	What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ
	(67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
	data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
	reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
	chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for
	discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.

	## Limitations

	- English only, ~11M params. This is a research and teaching base, not something to put in front of users or
	trust for facts.
	- It's reliable only on the easy commonsense and science multiple-choice where it beats random. On abstract
	reasoning (Winogrande, CommonsenseQA, ARC-Challenge) and arithmetic it's at chance.
	- No instruction tuning or safety alignment yet. It completes text; it does not follow instructions.
	- Trained on about one epoch of a curated mix, so coverage of rare facts is thin compared to models trained
	on far more tokens.

	## Files

	A standard model repo: `model.safetensors` (weights), `tokenizer.json` (8k byte-level BPE), `config.json`.
	Trained on a single Apple M2 with MLX in about six hours.

	## License

	Weights are open. Data falls under the respective dataset licenses (Cosmopedia, dolmino-mix ODC-By, AllenAI
	QA sets, FineMath).