File size: 7,156 Bytes
b743d9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: apache-2.0
language:
- en
library_name: mlx
pipeline_tag: text-generation
tags:
- rodan
- tiny-language-model
- mlx
- reasoning
- chain-of-thought
- dpo
base_model: bfuzzy1/Rodan-Chat
---

# Rodan-10M-Reasoning

A 10.41M-parameter reasoning model trained on a single Apple M2 with MLX. It stacks on the chat model and
adds **recurrent depth**: the same 8 transformer blocks run twice per forward pass, giving the effective
depth of a 16-layer network at **zero extra parameters**. The idea is to spend more compute per token on
hard problems without growing the model.

> What it is, honestly. The recurrence *mechanism* works, the probes show the second pass doing real
> compositional computation, and the activation-patching maps a genuine arithmetic circuit. The model does
> **accurate single-step arithmetic** and reads **natural-language word problems** into the right operation.
> A final **DPO** pass (verifiable preference pairs, KL-leashed) then fixed its restraint: it now answers
> simple facts directly instead of doing arithmetic on them (math-on-non-math prompts dropped from ~half to
> ~1 in 8), at no board cost. On the board it sits at **35.41**, about level with the base (35.80), because
> recurrent depth doesn't move discrimination benchmarks. The win is in *what it does*, not the board number.

> Part of the Rodan-10M series. Lineage: base v6 β†’ v9 (PLE-free) β†’ Chat (instruction fold) β†’ **Reasoning
> (this model)**. Warm-started from Chat, so it keeps instruction-following and ChatML.

## Architecture

Same as the base/chat stack, dim 320, 8 layers, 8 heads, MQA (1 KV head), SwiGLU 768, RMSNorm, RoPE base
200k, QK-norm, tied embeddings, value-residual, LRM, no PLE, with two changes:

- **`recurse=2`**: the 8 blocks run twice over the residual stream (16 effective layers, still 10.41M params).
- **ChatML + `<think>` template** for reasoning turns; direct answers for simple ones.

Trained in **bfloat16** (~8Γ— faster than fp32 on this M2 at this depth/length), seq 512.

## Training recipe

Warm-started from Chat, then trained at `recurse=2` on a natural-language-reasoning mix. The key lesson from
the first attempt: an arithmetic-symbol-heavy fold made the model narrow (it tried to compute *everything*).
This version leads with word problems and adds a slice of direct-answer examples to teach restraint.

| share | source | mode |
|---|---|---|
| 24% | natural-language word problems (synthesized) | `<think>` β†’ answer |
| 21% | symbolic arithmetic CoT | `<think>` β†’ answer |
| 8% | answer-only facts | direct, no `<think>` |
| 2% | GSM8K | `<think>` β†’ answer |
| 45% | replay (smol-smoltalk + curated: Cosmopedia / dolmino / FineMath / sci-QA) | mixed |

No web data anywhere, the curated-only lineage held since v6. Optimizer: Muon + AdamW, LR 1.8e-3 / Muon 9e-3,
seq 512, 7000 steps, bf16.

![Reasoning loss & data mix](loss_datamix.png)

## Does the recursion work?

Measured directly, the same way we probed value-residual and LRM on the base. The second pass earns its keep:

![Recursion probes](reasoning_probes.png)

The model leans hard on the second pass, run it at recurse 1 and held-out loss is much worse (ppl 5.72 vs
4.29). It flips the predicted token on ~23% of positions, and raises the probability of the correct next token
almost everywhere (+0.26 log-prob on average). It sharpens digits (entropy drops 0.14) and, unlike the first
attempt, the **quantitative-language words recovered** (+0.23), the natural-language word problems taught it
to handle "more / less / total / twice", which symbolic arithmetic alone never did.

Activation patching maps the arithmetic circuit causally: operands bind early, the computation resolves around
block 5, the answer is written at block 6, and multi-step problems unroll across depth (step 2 binds deeper
than step 1). Factual recall has a different shape, a single late lookup at block 6 with no early work. The
full circuit atlas is in `circuit.html`.

## Evaluation

Zero-shot lm-eval, limit 1000, recurse 2, raw.

| Task | Metric | Reasoning | Chat | v9 base | v6 base |
|---|---|---|---|---|---|
| HellaSwag | acc_norm | 31.9 | 30.1 | 30.1 | 31.8 |
| ARC-Easy | acc_norm | 36.7 | 35.3 | 35.4 | 35.6 |
| ARC-Challenge | acc_norm | 21.2 | 23.2 | 22.2 | 22.4 |
| PIQA | acc | 54.4 | 53.8 | 55.5 | 56.0 |
| ArithMark-2 | acc | 26.4 | 25.8 | 28.4 | 26.4 |
| LogicMark | acc | 43.3 | 48.5 | 44.8 | 44.8 |
| SciQ | acc | 67.4 | β€” | 67.8 | 67.5 |
| Winogrande | acc | 50.4 | β€” | 49.4 | 49.8 |
| **Board avg (Γ·4)** | | **35.41** | 35.04 | 35.70 | 35.80 |

(Numbers are the final DPO'd model. The pre-DPO fold scored 35.53; DPO held the board at 35.41, a noise-level
change, while fixing the restraint.)

Board 35.41, level with the base (v6 35.80) and above Chat. Recurrent depth doesn't move the board; that's
expected. What changed is behaviour, which the board can't see:

- **Arithmetic is accurate**, 4-5 of 6 on held-out single-step problems (`5+9=14`, `7Γ—6=42`, `40βˆ’13=27`),
  one step, stops cleanly. The earlier version mis-computed and over-reasoned.
- **Word problems translate**, "Sara has 12 apples and buys 7 more" β†’ it sets up `12 + 7` and solves it.
- **Sometimes answers directly**, "capital of France β†’ Paris", "opposite of hot β†’ cold", no `<think>`.

**The restraint fix (DPO).** The fold alone left restraint unstable, it opened a `<think>` and did arithmetic
on ~half of non-math prompts (the 8% answer-only data couldn't settle it). A final DPO pass on synthesized,
verifiable preference pairs fixed it: *mode* pairs (non-math β†’ direct answer ≻ spurious `<think>` math) and
*process* pairs (correct concise chain ≻ wrong/over-reasoned). LR 5e-7, Ξ² 0.1, 1 epoch, KL-leashed to the
frozen fold checkpoint. Result: **math-on-non-math dropped from ~4/8 to ~1/8**, board unchanged (35.53 β†’ 35.41).
DPO steered the *behaviour* it had; it did not fix the residual 2-digit arithmetic slips (e.g. 25βˆ’9), which are
a capability limit, not a preference one, that needs more/harder arithmetic data, not preference tuning.

![DPO effect, restraint fixed, board held](dpo_effect.png)

The arithmetic-compute slips on harder problems (multi-digit carry) remain the honest weak point.

## Usage

```python
ctx = f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
# greedy, NO repetition penalty (it breaks the <think> format) ; stop on <|im_end|>
```

Load at `recurse=2`. It emits `<think>` reasoning then the answer for math, and often answers directly for
simple facts. Trade quality for speed by lowering `recurse` at inference.

## Limitations

- ~10M params, English only, research/education. Not for production, facts, or advice.
- DPO fixed most of the over-reasoning, but it still opens a `<think>` on roughly 1 in 8 non-math prompts.
- Thin world knowledge. It answers directly now, but can be wrong on the fact itself.
- Arithmetic is reliable on simple problems and slips on harder multi-digit ones.
- No safety alignment.

## License

Weights open. Data under the respective dataset licenses (smol-smoltalk, GSM8K, Cosmopedia, dolmino-mix
ODC-By, AllenAI QA sets, FineMath).