Text Generation
MLX
Safetensors
English
rodan-modern
rodan
tiny-language-model
apple-silicon
byte-bpe
Instructions to use bfuzzy1/Rodan-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bfuzzy1/Rodan-Base with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("bfuzzy1/Rodan-Base") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use bfuzzy1/Rodan-Base with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "bfuzzy1/Rodan-Base" --prompt "Once upon a time"
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: mlx | |
| pipeline_tag: text-generation | |
| tags: | |
| - rodan | |
| - tiny-language-model | |
| - mlx | |
| - apple-silicon | |
| - byte-bpe | |
| # Rodan-10M | |
| A ~11M-parameter language model trained start to finish on one Apple M2 with MLX. The aim was a tiny model | |
| that actually holds up for its size, scored on how much it gets per parameter rather than raw leaderboard rank. | |
| | Model | Stage | Purpose | | |
| |---|---|---| | |
| | **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge | | |
| | Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following | | |
| | Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning | | |
| This card covers the base model only. The chat and reasoning stages are separate models with their own | |
| repos and cards. | |
| ## Architecture | |
| Decoder-only transformer, wide per layer (the proportions take a cue from Gemma-style edge models), 11.46M params. | |
| ``` | |
| vocab_size 8192 byte-level BPE | |
| dim 320 | |
| n_layers 8 | |
| n_heads 8 head_dim 40 | |
| n_kv_heads 1 MQA (8 query heads share 1 KV head) | |
| ffn_hidden 768 SwiGLU | |
| max_seq_len 512 | |
| norm RMSNorm (eps 1e-5) | |
| position RoPE (base 200000), applied after QK-norm | |
| tied_embeddings true | |
| value_residual true mix layer-0 values into later layers | |
| ple_rank 16 factorized per-layer value-embeddings | |
| lrm true learnable per-row/col weight multipliers (Falcon LRM) | |
| recurse 1 re-run the shared block stack N times (1 = base; >1 used by the reasoning stage) | |
| ``` | |
| The `recurse` knob is a recursive-depth mechanism (Universal-Transformer-style weight sharing, inspired by | |
| the TRM/HRM "tiny recursive reasoning" line). Setting `recurse=N` runs the same 8 blocks N times over the | |
| residual stream, so you get the effective depth of `8·N` layers at **zero extra parameters**. The base runs | |
| `recurse=1` (it's a plain 8-layer model). The reasoning stage warm-starts these weights and trains at | |
| `recurse=2` (16 effective layers, still 10.41M params), letting the model spend more compute per token on | |
| hard problems without growing. It is not the full TRM/HRM algorithm (no separate answer/latent states, no | |
| deep supervision); it's the shared-recursion idea applied to an autoregressive LM. | |
| It was built in two passes: a from-scratch base on 262M tokens, then a warm-start continue on another | |
| 115M tokens that adds LRM, raises the RoPE base from 10k to 200k, and mixes in 21% arithmetic/reasoning data | |
| (Falcon's reasoning-in-pretraining idea). That second pass is the 11.46M v6 checkpoint. | |
| Pre-norm residual blocks: `x += Attn(RMSNorm(x))`, then `x += SwiGLU(RMSNorm(x))`. Layer-0's attention | |
| values feed the value-residual mix in every later layer, and each layer also adds its own low-rank value-PLE. | |
| Why these specific choices at 11M, where every parameter has to earn its place: | |
| - 8k vocab with tied embeddings. Only about 23% of the params sit in the embedding table, versus roughly | |
| 70% for a 49k-vocab model this size. That frees most of the budget for the layers that do the computing. | |
| - MQA, because it's the cheapest attention that still works, which leaves params for depth and embeddings. | |
| - value-residual does most of the heavy lifting. A checkpoint probe shows later layers blending 77-99% of | |
| layer-0's values, so it acts as a shared value memory and a gradient highway at once. | |
| - LRM (learnable row/col multipliers) probed about 20% off identity, so the model is genuinely using it. | |
| - QK-norm for attention stability, from the nanoGPT-speedrun stack. | |
| - value-PLE we tried and then removed. The probe found it dead: 0.2% contribution, weight-decayed to near | |
| zero. v9 drops it and lands at 10.41M with no loss in quality. | |
| ## Training | |
| - Optimizer: Muon on the 2D hidden weights, AdamW on the embeddings, norms, and LRM multipliers, joined | |
| through MultiOptimizer, cosine LR, grad-clip 1.0. | |
| - Framework: MLX on Apple Silicon, with an `mx.compile`d step. About 0.6-0.7 it/s on one fanless M2 MacBook Air. | |
| - Data: a warm-start chain of short stages, fresh tokens each time so nothing gets re-looped and memorized. | |
| Here are the base (v6) and the challenger that followed it (v9): | |
| | Source | v6 base (mixed5) | v9 (mixed8) | Content | | |
| |---|---|---|---| | |
| | Cosmopedia v2 | 27% | 31% | synthetic textbooks → commonsense | | |
| | dolmino-mix-1124 (pes2o + StackExchange) | 35% | 26% | academic papers + Q&A → knowledge/ARC | | |
| | synthetic arithmetic (ArithMark-style) | 21% | 19% | computation → ArithMark | | |
| | FineMath-4plus | 10% | 15% | math prose | | |
| | science-QA (SciQ/OBQA/QASC/ARC-train) | 6% | 9% | science MC | | |
| | **tokens** | ~0.38B | +0.12B fresh | curated, no raw web | | |
| Two things we found out the hard way. First, adding FineWeb-Edu (45%, then 25%) lost to v6 both times, in | |
| a clean monotonic line: raw web hurts at 11M. The model is too small to digest it, and the curated | |
| synthetic-plus-academic mix wins instead. Second, the probe that killed value-PLE also confirmed | |
| value-residual and LRM are doing real work. So v9 is the pure-curated, PLE-free version at 10.41M: it | |
| drops both of the things we'd shown were dead weight and keeps the recipe that worked. | |
| Training-compute efficiency, from the actual runs (perplexity vs cumulative FLOPs, `6·N·tokens`): | |
|  | |
| Intelligence per parameter (board avg vs log-params; the shaded region is above the size-fit line): | |
|  | |
| The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6 | |
| sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other | |
| similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading | |
| models, which train on about 25B. | |
| Training loss and data mix, v6 vs v9: | |
|  | |
| v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-curated mix. The result was a | |
| tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It | |
| gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally | |
| showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall | |
| out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we | |
| ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of | |
| them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a | |
| small fraction of what the leading models use, so there is likely more to gain from additional curated tokens. | |
| ## Evaluation | |
| Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the | |
| length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise. | |
| "The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) | |
| (AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task. | |
| Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4. | |
| | Task | Metric | Score | Random | | |
| |---|---|---|---| | |
| | SciQ | acc | 67.5 | 25 | | |
| | PIQA | acc | 56.0 | 50 | | |
| | COPA | acc | 55.0 | 50 | | |
| | ARC-Easy | acc_norm | 35.6 | 25 | | |
| | HellaSwag | acc_norm | 31.8 | 25 | | |
| | OpenBookQA | acc_norm | 27.0 | 25 | | |
| | ArithMark-2 | acc | 26.4 | 25 | | |
| | ARC-Challenge | acc_norm | 22.4 | 25 | | |
| | Winogrande | acc | 49.8 | 50 | | |
| | LogicMark | acc | 44.8 | 25 | | |
| | BoolQ | acc | 37.6 | ~50 | | |
| | CommonsenseQA | acc | 20.7 | 20 | | |
| | **Board avg (÷4)** | | **35.80** | | | |
| For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about | |
| 1/65th the tokens: | |
| | Model | Params | Tokens | Board avg (÷4) | | |
| |---|---|---|---| | |
| | **Rodan-10M-Base (v6)** | 11.46M | ~0.38B | **35.80** | | |
| | Liodon SLM-10M | 10M | 25B | 35.09 | | |
| | GPT-S-5M (Axiomic) | 5.2M | 25B | 34.75 | | |
|  | |
| v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger | |
| (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too. | |
| v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From | |
| here the work moved to the capability stages (chat, reasoning). | |
| What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ | |
| (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation | |
| data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract | |
| reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near | |
| chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for | |
| discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models. | |
| ## Limitations | |
| - English only, ~11M params. This is a research and teaching base, not something to put in front of users or | |
| trust for facts. | |
| - It's reliable only on the easy commonsense and science multiple-choice where it beats random. On abstract | |
| reasoning (Winogrande, CommonsenseQA, ARC-Challenge) and arithmetic it's at chance. | |
| - No instruction tuning or safety alignment yet. It completes text; it does not follow instructions. | |
| - Trained on about one epoch of a curated mix, so coverage of rare facts is thin compared to models trained | |
| on far more tokens. | |
| ## Files | |
| A standard model repo: `model.safetensors` (weights), `tokenizer.json` (8k byte-level BPE), `config.json`. | |
| Trained on a single Apple M2 with MLX in about six hours. | |
| ## License | |
| Weights are open. Data falls under the respective dataset licenses (Cosmopedia, dolmino-mix ODC-By, AllenAI | |
| QA sets, FineMath). | |