Text Generation
MLX
Safetensors
English
rodan-modern
rodan
tiny-language-model
reasoning
chain-of-thought
dpo
Instructions to use bfuzzy1/Rodan-Reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bfuzzy1/Rodan-Reasoning with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("bfuzzy1/Rodan-Reasoning") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use bfuzzy1/Rodan-Reasoning with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "bfuzzy1/Rodan-Reasoning" --prompt "Once upon a time"
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: mlx | |
| pipeline_tag: text-generation | |
| tags: | |
| - rodan | |
| - tiny-language-model | |
| - mlx | |
| - reasoning | |
| - chain-of-thought | |
| - dpo | |
| base_model: bfuzzy1/Rodan-Chat | |
| # Rodan-10M-Reasoning | |
| A 10.41M-parameter reasoning model trained on a single Apple M2 with MLX. It stacks on the chat model and | |
| adds **recurrent depth**: the same 8 transformer blocks run twice per forward pass, giving the effective | |
| depth of a 16-layer network at **zero extra parameters**. The idea is to spend more compute per token on | |
| hard problems without growing the model. | |
| > What it is, honestly. The recurrence *mechanism* works, the probes show the second pass doing real | |
| > compositional computation, and the activation-patching maps a genuine arithmetic circuit. The model does | |
| > **accurate single-step arithmetic** and reads **natural-language word problems** into the right operation. | |
| > A final **DPO** pass (verifiable preference pairs, KL-leashed) then fixed its restraint: it now answers | |
| > simple facts directly instead of doing arithmetic on them (math-on-non-math prompts dropped from ~half to | |
| > ~1 in 8), at no board cost. On the board it sits at **35.41**, about level with the base (35.80), because | |
| > recurrent depth doesn't move discrimination benchmarks. The win is in *what it does*, not the board number. | |
| > Part of the Rodan-10M series. Lineage: base v6 β v9 (PLE-free) β Chat (instruction fold) β **Reasoning | |
| > (this model)**. Warm-started from Chat, so it keeps instruction-following and ChatML. | |
| ## Architecture | |
| Same as the base/chat stack, dim 320, 8 layers, 8 heads, MQA (1 KV head), SwiGLU 768, RMSNorm, RoPE base | |
| 200k, QK-norm, tied embeddings, value-residual, LRM, no PLE, with two changes: | |
| - **`recurse=2`**: the 8 blocks run twice over the residual stream (16 effective layers, still 10.41M params). | |
| - **ChatML + `<think>` template** for reasoning turns; direct answers for simple ones. | |
| Trained in **bfloat16** (~8Γ faster than fp32 on this M2 at this depth/length), seq 512. | |
| ## Training recipe | |
| Warm-started from Chat, then trained at `recurse=2` on a natural-language-reasoning mix. The key lesson from | |
| the first attempt: an arithmetic-symbol-heavy fold made the model narrow (it tried to compute *everything*). | |
| This version leads with word problems and adds a slice of direct-answer examples to teach restraint. | |
| | share | source | mode | | |
| |---|---|---| | |
| | 24% | natural-language word problems (synthesized) | `<think>` β answer | | |
| | 21% | symbolic arithmetic CoT | `<think>` β answer | | |
| | 8% | answer-only facts | direct, no `<think>` | | |
| | 2% | GSM8K | `<think>` β answer | | |
| | 45% | replay (smol-smoltalk + curated: Cosmopedia / dolmino / FineMath / sci-QA) | mixed | | |
| No web data anywhere, the curated-only lineage held since v6. Optimizer: Muon + AdamW, LR 1.8e-3 / Muon 9e-3, | |
| seq 512, 7000 steps, bf16. | |
|  | |
| ## Does the recursion work? | |
| Measured directly, the same way we probed value-residual and LRM on the base. The second pass earns its keep: | |
|  | |
| The model leans hard on the second pass, run it at recurse 1 and held-out loss is much worse (ppl 5.72 vs | |
| 4.29). It flips the predicted token on ~23% of positions, and raises the probability of the correct next token | |
| almost everywhere (+0.26 log-prob on average). It sharpens digits (entropy drops 0.14) and, unlike the first | |
| attempt, the **quantitative-language words recovered** (+0.23), the natural-language word problems taught it | |
| to handle "more / less / total / twice", which symbolic arithmetic alone never did. | |
| Activation patching maps the arithmetic circuit causally: operands bind early, the computation resolves around | |
| block 5, the answer is written at block 6, and multi-step problems unroll across depth (step 2 binds deeper | |
| than step 1). Factual recall has a different shape, a single late lookup at block 6 with no early work. The | |
| full circuit atlas is in `circuit.html`. | |
| ## Evaluation | |
| Zero-shot lm-eval, limit 1000, recurse 2, raw. | |
| | Task | Metric | Reasoning | Chat | v9 base | v6 base | | |
| |---|---|---|---|---|---| | |
| | HellaSwag | acc_norm | 31.9 | 30.1 | 30.1 | 31.8 | | |
| | ARC-Easy | acc_norm | 36.7 | 35.3 | 35.4 | 35.6 | | |
| | ARC-Challenge | acc_norm | 21.2 | 23.2 | 22.2 | 22.4 | | |
| | PIQA | acc | 54.4 | 53.8 | 55.5 | 56.0 | | |
| | ArithMark-2 | acc | 26.4 | 25.8 | 28.4 | 26.4 | | |
| | LogicMark | acc | 43.3 | 48.5 | 44.8 | 44.8 | | |
| | SciQ | acc | 67.4 | β | 67.8 | 67.5 | | |
| | Winogrande | acc | 50.4 | β | 49.4 | 49.8 | | |
| | **Board avg (Γ·4)** | | **35.41** | 35.04 | 35.70 | 35.80 | | |
| (Numbers are the final DPO'd model. The pre-DPO fold scored 35.53; DPO held the board at 35.41, a noise-level | |
| change, while fixing the restraint.) | |
| Board 35.41, level with the base (v6 35.80) and above Chat. Recurrent depth doesn't move the board; that's | |
| expected. What changed is behaviour, which the board can't see: | |
| - **Arithmetic is accurate**, 4-5 of 6 on held-out single-step problems (`5+9=14`, `7Γ6=42`, `40β13=27`), | |
| one step, stops cleanly. The earlier version mis-computed and over-reasoned. | |
| - **Word problems translate**, "Sara has 12 apples and buys 7 more" β it sets up `12 + 7` and solves it. | |
| - **Sometimes answers directly**, "capital of France β Paris", "opposite of hot β cold", no `<think>`. | |
| **The restraint fix (DPO).** The fold alone left restraint unstable, it opened a `<think>` and did arithmetic | |
| on ~half of non-math prompts (the 8% answer-only data couldn't settle it). A final DPO pass on synthesized, | |
| verifiable preference pairs fixed it: *mode* pairs (non-math β direct answer β» spurious `<think>` math) and | |
| *process* pairs (correct concise chain β» wrong/over-reasoned). LR 5e-7, Ξ² 0.1, 1 epoch, KL-leashed to the | |
| frozen fold checkpoint. Result: **math-on-non-math dropped from ~4/8 to ~1/8**, board unchanged (35.53 β 35.41). | |
| DPO steered the *behaviour* it had; it did not fix the residual 2-digit arithmetic slips (e.g. 25β9), which are | |
| a capability limit, not a preference one, that needs more/harder arithmetic data, not preference tuning. | |
|  | |
| The arithmetic-compute slips on harder problems (multi-digit carry) remain the honest weak point. | |
| ## Usage | |
| ```python | |
| ctx = f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n" | |
| # greedy, NO repetition penalty (it breaks the <think> format) ; stop on <|im_end|> | |
| ``` | |
| Load at `recurse=2`. It emits `<think>` reasoning then the answer for math, and often answers directly for | |
| simple facts. Trade quality for speed by lowering `recurse` at inference. | |
| ## Limitations | |
| - ~10M params, English only, research/education. Not for production, facts, or advice. | |
| - DPO fixed most of the over-reasoning, but it still opens a `<think>` on roughly 1 in 8 non-math prompts. | |
| - Thin world knowledge. It answers directly now, but can be wrong on the fact itself. | |
| - Arithmetic is reliable on simple problems and slips on harder multi-digit ones. | |
| - No safety alignment. | |
| ## License | |
| Weights open. Data under the respective dataset licenses (smol-smoltalk, GSM8K, Cosmopedia, dolmino-mix | |
| ODC-By, AllenAI QA sets, FineMath). | |