Upload folder using huggingface_hub

b743d9d verified 10 days ago

7.16 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: mlx
	pipeline_tag: text-generation
	tags:
	- rodan
	- tiny-language-model
	- mlx
	- reasoning
	- chain-of-thought
	- dpo
	base_model: bfuzzy1/Rodan-Chat
	---

	# Rodan-10M-Reasoning

	A 10.41M-parameter reasoning model trained on a single Apple M2 with MLX. It stacks on the chat model and
	adds recurrent depth: the same 8 transformer blocks run twice per forward pass, giving the effective
	depth of a 16-layer network at zero extra parameters. The idea is to spend more compute per token on
	hard problems without growing the model.

	> What it is, honestly. The recurrence mechanism works, the probes show the second pass doing real
	> compositional computation, and the activation-patching maps a genuine arithmetic circuit. The model does
	> accurate single-step arithmetic and reads natural-language word problems into the right operation.
	> A final DPO pass (verifiable preference pairs, KL-leashed) then fixed its restraint: it now answers
	> simple facts directly instead of doing arithmetic on them (math-on-non-math prompts dropped from ~half to
	> ~1 in 8), at no board cost. On the board it sits at 35.41, about level with the base (35.80), because
	> recurrent depth doesn't move discrimination benchmarks. The win is in what it does, not the board number.

	> Part of the Rodan-10M series. Lineage: base v6 → v9 (PLE-free) → Chat (instruction fold) → **Reasoning
	> (this model)**. Warm-started from Chat, so it keeps instruction-following and ChatML.

	## Architecture

	Same as the base/chat stack, dim 320, 8 layers, 8 heads, MQA (1 KV head), SwiGLU 768, RMSNorm, RoPE base
	200k, QK-norm, tied embeddings, value-residual, LRM, no PLE, with two changes:

	- `recurse=2`: the 8 blocks run twice over the residual stream (16 effective layers, still 10.41M params).
	- ChatML + `<think>` template for reasoning turns; direct answers for simple ones.

	Trained in bfloat16 (~8× faster than fp32 on this M2 at this depth/length), seq 512.

	## Training recipe

	Warm-started from Chat, then trained at `recurse=2` on a natural-language-reasoning mix. The key lesson from
	the first attempt: an arithmetic-symbol-heavy fold made the model narrow (it tried to compute everything).
	This version leads with word problems and adds a slice of direct-answer examples to teach restraint.

	\| share \| source \| mode \|
	\|---\|---\|---\|
	\| 24% \| natural-language word problems (synthesized) \| `<think>` → answer \|
	\| 21% \| symbolic arithmetic CoT \| `<think>` → answer \|
	\| 8% \| answer-only facts \| direct, no `<think>` \|
	\| 2% \| GSM8K \| `<think>` → answer \|
	\| 45% \| replay (smol-smoltalk + curated: Cosmopedia / dolmino / FineMath / sci-QA) \| mixed \|

	No web data anywhere, the curated-only lineage held since v6. Optimizer: Muon + AdamW, LR 1.8e-3 / Muon 9e-3,
	seq 512, 7000 steps, bf16.

	![Reasoning loss & data mix](loss_datamix.png)

	## Does the recursion work?

	Measured directly, the same way we probed value-residual and LRM on the base. The second pass earns its keep:

	![Recursion probes](reasoning_probes.png)

	The model leans hard on the second pass, run it at recurse 1 and held-out loss is much worse (ppl 5.72 vs
	4.29). It flips the predicted token on ~23% of positions, and raises the probability of the correct next token
	almost everywhere (+0.26 log-prob on average). It sharpens digits (entropy drops 0.14) and, unlike the first
	attempt, the quantitative-language words recovered (+0.23), the natural-language word problems taught it
	to handle "more / less / total / twice", which symbolic arithmetic alone never did.

	Activation patching maps the arithmetic circuit causally: operands bind early, the computation resolves around
	block 5, the answer is written at block 6, and multi-step problems unroll across depth (step 2 binds deeper
	than step 1). Factual recall has a different shape, a single late lookup at block 6 with no early work. The
	full circuit atlas is in `circuit.html`.

	## Evaluation

	Zero-shot lm-eval, limit 1000, recurse 2, raw.

	\| Task \| Metric \| Reasoning \| Chat \| v9 base \| v6 base \|
	\|---\|---\|---\|---\|---\|---\|
	\| HellaSwag \| acc_norm \| 31.9 \| 30.1 \| 30.1 \| 31.8 \|
	\| ARC-Easy \| acc_norm \| 36.7 \| 35.3 \| 35.4 \| 35.6 \|
	\| ARC-Challenge \| acc_norm \| 21.2 \| 23.2 \| 22.2 \| 22.4 \|
	\| PIQA \| acc \| 54.4 \| 53.8 \| 55.5 \| 56.0 \|
	\| ArithMark-2 \| acc \| 26.4 \| 25.8 \| 28.4 \| 26.4 \|
	\| LogicMark \| acc \| 43.3 \| 48.5 \| 44.8 \| 44.8 \|
	\| SciQ \| acc \| 67.4 \| — \| 67.8 \| 67.5 \|
	\| Winogrande \| acc \| 50.4 \| — \| 49.4 \| 49.8 \|
	\| Board avg (÷4) \| \| 35.41 \| 35.04 \| 35.70 \| 35.80 \|

	(Numbers are the final DPO'd model. The pre-DPO fold scored 35.53; DPO held the board at 35.41, a noise-level
	change, while fixing the restraint.)

	Board 35.41, level with the base (v6 35.80) and above Chat. Recurrent depth doesn't move the board; that's
	expected. What changed is behaviour, which the board can't see:

	- Arithmetic is accurate, 4-5 of 6 on held-out single-step problems (`5+9=14`, `7×6=42`, `40−13=27`),
	one step, stops cleanly. The earlier version mis-computed and over-reasoned.
	- Word problems translate, "Sara has 12 apples and buys 7 more" → it sets up `12 + 7` and solves it.
	- Sometimes answers directly, "capital of France → Paris", "opposite of hot → cold", no `<think>`.

	The restraint fix (DPO). The fold alone left restraint unstable, it opened a `<think>` and did arithmetic
	on ~half of non-math prompts (the 8% answer-only data couldn't settle it). A final DPO pass on synthesized,
	verifiable preference pairs fixed it: mode pairs (non-math → direct answer ≻ spurious `<think>` math) and
	process pairs (correct concise chain ≻ wrong/over-reasoned). LR 5e-7, β 0.1, 1 epoch, KL-leashed to the
	frozen fold checkpoint. Result: math-on-non-math dropped from ~4/8 to ~1/8, board unchanged (35.53 → 35.41).
	DPO steered the behaviour it had; it did not fix the residual 2-digit arithmetic slips (e.g. 25−9), which are
	a capability limit, not a preference one, that needs more/harder arithmetic data, not preference tuning.

	![DPO effect, restraint fixed, board held](dpo_effect.png)

	The arithmetic-compute slips on harder problems (multi-digit carry) remain the honest weak point.

	## Usage

	```python
	ctx = f"<\|im_start\|>user\n{question}<\|im_end\|>\n<\|im_start\|>assistant\n"
	# greedy, NO repetition penalty (it breaks the <think> format) ; stop on <\|im_end\|>
	```

	Load at `recurse=2`. It emits `<think>` reasoning then the answer for math, and often answers directly for
	simple facts. Trade quality for speed by lowering `recurse` at inference.

	## Limitations

	- ~10M params, English only, research/education. Not for production, facts, or advice.
	- DPO fixed most of the over-reasoning, but it still opens a `<think>` on roughly 1 in 8 non-math prompts.
	- Thin world knowledge. It answers directly now, but can be wrong on the fact itself.
	- Arithmetic is reliable on simple problems and slips on harder multi-digit ones.
	- No safety alignment.

	## License

	Weights open. Data under the respective dataset licenses (smol-smoltalk, GSM8K, Cosmopedia, dolmino-mix
	ODC-By, AllenAI QA sets, FineMath).