--- license: apache-2.0 language: - en library_name: mlx pipeline_tag: text-generation tags: - rodan - tiny-language-model - mlx - reasoning - chain-of-thought - dpo base_model: bfuzzy1/Rodan-Chat --- # Rodan-10M-Reasoning A 10.41M-parameter reasoning model trained on a single Apple M2 with MLX. It stacks on the chat model and adds **recurrent depth**: the same 8 transformer blocks run twice per forward pass, giving the effective depth of a 16-layer network at **zero extra parameters**. The idea is to spend more compute per token on hard problems without growing the model. > What it is, honestly. The recurrence *mechanism* works, the probes show the second pass doing real > compositional computation, and the activation-patching maps a genuine arithmetic circuit. The model does > **accurate single-step arithmetic** and reads **natural-language word problems** into the right operation. > A final **DPO** pass (verifiable preference pairs, KL-leashed) then fixed its restraint: it now answers > simple facts directly instead of doing arithmetic on them (math-on-non-math prompts dropped from ~half to > ~1 in 8), at no board cost. On the board it sits at **35.41**, about level with the base (35.80), because > recurrent depth doesn't move discrimination benchmarks. The win is in *what it does*, not the board number. > Part of the Rodan-10M series. Lineage: base v6 → v9 (PLE-free) → Chat (instruction fold) → **Reasoning > (this model)**. Warm-started from Chat, so it keeps instruction-following and ChatML. ## Architecture Same as the base/chat stack, dim 320, 8 layers, 8 heads, MQA (1 KV head), SwiGLU 768, RMSNorm, RoPE base 200k, QK-norm, tied embeddings, value-residual, LRM, no PLE, with two changes: - **`recurse=2`**: the 8 blocks run twice over the residual stream (16 effective layers, still 10.41M params). - **ChatML + `` template** for reasoning turns; direct answers for simple ones. Trained in **bfloat16** (~8× faster than fp32 on this M2 at this depth/length), seq 512. ## Training recipe Warm-started from Chat, then trained at `recurse=2` on a natural-language-reasoning mix. The key lesson from the first attempt: an arithmetic-symbol-heavy fold made the model narrow (it tried to compute *everything*). This version leads with word problems and adds a slice of direct-answer examples to teach restraint. | share | source | mode | |---|---|---| | 24% | natural-language word problems (synthesized) | `` → answer | | 21% | symbolic arithmetic CoT | `` → answer | | 8% | answer-only facts | direct, no `` | | 2% | GSM8K | `` → answer | | 45% | replay (smol-smoltalk + curated: Cosmopedia / dolmino / FineMath / sci-QA) | mixed | No web data anywhere, the curated-only lineage held since v6. Optimizer: Muon + AdamW, LR 1.8e-3 / Muon 9e-3, seq 512, 7000 steps, bf16. ![Reasoning loss & data mix](loss_datamix.png) ## Does the recursion work? Measured directly, the same way we probed value-residual and LRM on the base. The second pass earns its keep: ![Recursion probes](reasoning_probes.png) The model leans hard on the second pass, run it at recurse 1 and held-out loss is much worse (ppl 5.72 vs 4.29). It flips the predicted token on ~23% of positions, and raises the probability of the correct next token almost everywhere (+0.26 log-prob on average). It sharpens digits (entropy drops 0.14) and, unlike the first attempt, the **quantitative-language words recovered** (+0.23), the natural-language word problems taught it to handle "more / less / total / twice", which symbolic arithmetic alone never did. Activation patching maps the arithmetic circuit causally: operands bind early, the computation resolves around block 5, the answer is written at block 6, and multi-step problems unroll across depth (step 2 binds deeper than step 1). Factual recall has a different shape, a single late lookup at block 6 with no early work. The full circuit atlas is in `circuit.html`. ## Evaluation Zero-shot lm-eval, limit 1000, recurse 2, raw. | Task | Metric | Reasoning | Chat | v9 base | v6 base | |---|---|---|---|---|---| | HellaSwag | acc_norm | 31.9 | 30.1 | 30.1 | 31.8 | | ARC-Easy | acc_norm | 36.7 | 35.3 | 35.4 | 35.6 | | ARC-Challenge | acc_norm | 21.2 | 23.2 | 22.2 | 22.4 | | PIQA | acc | 54.4 | 53.8 | 55.5 | 56.0 | | ArithMark-2 | acc | 26.4 | 25.8 | 28.4 | 26.4 | | LogicMark | acc | 43.3 | 48.5 | 44.8 | 44.8 | | SciQ | acc | 67.4 | — | 67.8 | 67.5 | | Winogrande | acc | 50.4 | — | 49.4 | 49.8 | | **Board avg (÷4)** | | **35.41** | 35.04 | 35.70 | 35.80 | (Numbers are the final DPO'd model. The pre-DPO fold scored 35.53; DPO held the board at 35.41, a noise-level change, while fixing the restraint.) Board 35.41, level with the base (v6 35.80) and above Chat. Recurrent depth doesn't move the board; that's expected. What changed is behaviour, which the board can't see: - **Arithmetic is accurate**, 4-5 of 6 on held-out single-step problems (`5+9=14`, `7×6=42`, `40−13=27`), one step, stops cleanly. The earlier version mis-computed and over-reasoned. - **Word problems translate**, "Sara has 12 apples and buys 7 more" → it sets up `12 + 7` and solves it. - **Sometimes answers directly**, "capital of France → Paris", "opposite of hot → cold", no ``. **The restraint fix (DPO).** The fold alone left restraint unstable, it opened a `` and did arithmetic on ~half of non-math prompts (the 8% answer-only data couldn't settle it). A final DPO pass on synthesized, verifiable preference pairs fixed it: *mode* pairs (non-math → direct answer ≻ spurious `` math) and *process* pairs (correct concise chain ≻ wrong/over-reasoned). LR 5e-7, β 0.1, 1 epoch, KL-leashed to the frozen fold checkpoint. Result: **math-on-non-math dropped from ~4/8 to ~1/8**, board unchanged (35.53 → 35.41). DPO steered the *behaviour* it had; it did not fix the residual 2-digit arithmetic slips (e.g. 25−9), which are a capability limit, not a preference one, that needs more/harder arithmetic data, not preference tuning. ![DPO effect, restraint fixed, board held](dpo_effect.png) The arithmetic-compute slips on harder problems (multi-digit carry) remain the honest weak point. ## Usage ```python ctx = f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n" # greedy, NO repetition penalty (it breaks the format) ; stop on <|im_end|> ``` Load at `recurse=2`. It emits `` reasoning then the answer for math, and often answers directly for simple facts. Trade quality for speed by lowering `recurse` at inference. ## Limitations - ~10M params, English only, research/education. Not for production, facts, or advice. - DPO fixed most of the over-reasoning, but it still opens a `` on roughly 1 in 8 non-math prompts. - Thin world knowledge. It answers directly now, but can be wrong on the fact itself. - Arithmetic is reliable on simple problems and slips on harder multi-digit ones. - No safety alignment. ## License Weights open. Data under the respective dataset licenses (smol-smoltalk, GSM8K, Cosmopedia, dolmino-mix ODC-By, AllenAI QA sets, FineMath).