Rodan-Reasoning / README.md
bfuzzy1's picture
Upload folder using huggingface_hub
b743d9d verified
|
Raw
History Blame Contribute Delete
7.16 kB
---
license: apache-2.0
language:
- en
library_name: mlx
pipeline_tag: text-generation
tags:
- rodan
- tiny-language-model
- mlx
- reasoning
- chain-of-thought
- dpo
base_model: bfuzzy1/Rodan-Chat
---
# Rodan-10M-Reasoning
A 10.41M-parameter reasoning model trained on a single Apple M2 with MLX. It stacks on the chat model and
adds **recurrent depth**: the same 8 transformer blocks run twice per forward pass, giving the effective
depth of a 16-layer network at **zero extra parameters**. The idea is to spend more compute per token on
hard problems without growing the model.
> What it is, honestly. The recurrence *mechanism* works, the probes show the second pass doing real
> compositional computation, and the activation-patching maps a genuine arithmetic circuit. The model does
> **accurate single-step arithmetic** and reads **natural-language word problems** into the right operation.
> A final **DPO** pass (verifiable preference pairs, KL-leashed) then fixed its restraint: it now answers
> simple facts directly instead of doing arithmetic on them (math-on-non-math prompts dropped from ~half to
> ~1 in 8), at no board cost. On the board it sits at **35.41**, about level with the base (35.80), because
> recurrent depth doesn't move discrimination benchmarks. The win is in *what it does*, not the board number.
> Part of the Rodan-10M series. Lineage: base v6 β†’ v9 (PLE-free) β†’ Chat (instruction fold) β†’ **Reasoning
> (this model)**. Warm-started from Chat, so it keeps instruction-following and ChatML.
## Architecture
Same as the base/chat stack, dim 320, 8 layers, 8 heads, MQA (1 KV head), SwiGLU 768, RMSNorm, RoPE base
200k, QK-norm, tied embeddings, value-residual, LRM, no PLE, with two changes:
- **`recurse=2`**: the 8 blocks run twice over the residual stream (16 effective layers, still 10.41M params).
- **ChatML + `<think>` template** for reasoning turns; direct answers for simple ones.
Trained in **bfloat16** (~8Γ— faster than fp32 on this M2 at this depth/length), seq 512.
## Training recipe
Warm-started from Chat, then trained at `recurse=2` on a natural-language-reasoning mix. The key lesson from
the first attempt: an arithmetic-symbol-heavy fold made the model narrow (it tried to compute *everything*).
This version leads with word problems and adds a slice of direct-answer examples to teach restraint.
| share | source | mode |
|---|---|---|
| 24% | natural-language word problems (synthesized) | `<think>` β†’ answer |
| 21% | symbolic arithmetic CoT | `<think>` β†’ answer |
| 8% | answer-only facts | direct, no `<think>` |
| 2% | GSM8K | `<think>` β†’ answer |
| 45% | replay (smol-smoltalk + curated: Cosmopedia / dolmino / FineMath / sci-QA) | mixed |
No web data anywhere, the curated-only lineage held since v6. Optimizer: Muon + AdamW, LR 1.8e-3 / Muon 9e-3,
seq 512, 7000 steps, bf16.
![Reasoning loss & data mix](loss_datamix.png)
## Does the recursion work?
Measured directly, the same way we probed value-residual and LRM on the base. The second pass earns its keep:
![Recursion probes](reasoning_probes.png)
The model leans hard on the second pass, run it at recurse 1 and held-out loss is much worse (ppl 5.72 vs
4.29). It flips the predicted token on ~23% of positions, and raises the probability of the correct next token
almost everywhere (+0.26 log-prob on average). It sharpens digits (entropy drops 0.14) and, unlike the first
attempt, the **quantitative-language words recovered** (+0.23), the natural-language word problems taught it
to handle "more / less / total / twice", which symbolic arithmetic alone never did.
Activation patching maps the arithmetic circuit causally: operands bind early, the computation resolves around
block 5, the answer is written at block 6, and multi-step problems unroll across depth (step 2 binds deeper
than step 1). Factual recall has a different shape, a single late lookup at block 6 with no early work. The
full circuit atlas is in `circuit.html`.
## Evaluation
Zero-shot lm-eval, limit 1000, recurse 2, raw.
| Task | Metric | Reasoning | Chat | v9 base | v6 base |
|---|---|---|---|---|---|
| HellaSwag | acc_norm | 31.9 | 30.1 | 30.1 | 31.8 |
| ARC-Easy | acc_norm | 36.7 | 35.3 | 35.4 | 35.6 |
| ARC-Challenge | acc_norm | 21.2 | 23.2 | 22.2 | 22.4 |
| PIQA | acc | 54.4 | 53.8 | 55.5 | 56.0 |
| ArithMark-2 | acc | 26.4 | 25.8 | 28.4 | 26.4 |
| LogicMark | acc | 43.3 | 48.5 | 44.8 | 44.8 |
| SciQ | acc | 67.4 | β€” | 67.8 | 67.5 |
| Winogrande | acc | 50.4 | β€” | 49.4 | 49.8 |
| **Board avg (Γ·4)** | | **35.41** | 35.04 | 35.70 | 35.80 |
(Numbers are the final DPO'd model. The pre-DPO fold scored 35.53; DPO held the board at 35.41, a noise-level
change, while fixing the restraint.)
Board 35.41, level with the base (v6 35.80) and above Chat. Recurrent depth doesn't move the board; that's
expected. What changed is behaviour, which the board can't see:
- **Arithmetic is accurate**, 4-5 of 6 on held-out single-step problems (`5+9=14`, `7Γ—6=42`, `40βˆ’13=27`),
one step, stops cleanly. The earlier version mis-computed and over-reasoned.
- **Word problems translate**, "Sara has 12 apples and buys 7 more" β†’ it sets up `12 + 7` and solves it.
- **Sometimes answers directly**, "capital of France β†’ Paris", "opposite of hot β†’ cold", no `<think>`.
**The restraint fix (DPO).** The fold alone left restraint unstable, it opened a `<think>` and did arithmetic
on ~half of non-math prompts (the 8% answer-only data couldn't settle it). A final DPO pass on synthesized,
verifiable preference pairs fixed it: *mode* pairs (non-math β†’ direct answer ≻ spurious `<think>` math) and
*process* pairs (correct concise chain ≻ wrong/over-reasoned). LR 5e-7, Ξ² 0.1, 1 epoch, KL-leashed to the
frozen fold checkpoint. Result: **math-on-non-math dropped from ~4/8 to ~1/8**, board unchanged (35.53 β†’ 35.41).
DPO steered the *behaviour* it had; it did not fix the residual 2-digit arithmetic slips (e.g. 25βˆ’9), which are
a capability limit, not a preference one, that needs more/harder arithmetic data, not preference tuning.
![DPO effect, restraint fixed, board held](dpo_effect.png)
The arithmetic-compute slips on harder problems (multi-digit carry) remain the honest weak point.
## Usage
```python
ctx = f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
# greedy, NO repetition penalty (it breaks the <think> format) ; stop on <|im_end|>
```
Load at `recurse=2`. It emits `<think>` reasoning then the answer for math, and often answers directly for
simple facts. Trade quality for speed by lowering `recurse` at inference.
## Limitations
- ~10M params, English only, research/education. Not for production, facts, or advice.
- DPO fixed most of the over-reasoning, but it still opens a `<think>` on roughly 1 in 8 non-math prompts.
- Thin world knowledge. It answers directly now, but can be wrong on the fact itself.
- Arithmetic is reliable on simple problems and slips on harder multi-digit ones.
- No safety alignment.
## License
Weights open. Data under the respective dataset licenses (smol-smoltalk, GSM8K, Cosmopedia, dolmino-mix
ODC-By, AllenAI QA sets, FineMath).