Upload folder using huggingface_hub

a4c1304 verified 11 days ago

5.93 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: mlx
	pipeline_tag: text-generation
	tags:
	- rodan
	- tiny-language-model
	- mlx
	- chat
	- chatml
	base_model: bfuzzy1/Rodan-Base
	---

	# Rodan-10M-Chat

	A 10.41M-parameter chat model trained on one Apple M2 with MLX. It starts from the leaner Rodan-10M v9 base
	(PLE-free) and folds conversational data into a short continued-pretrain stage, which adds chat ability
	without throwing away what the base already knew.

	> Part of the Rodan-10M series. The base model (`Rodan-10M-Base`, v6) lives in its own repo; this is the
	> chat stage built on the v9 checkpoint.

	## What it is

	A ~10M chat model. With the right decoding settings it answers simple instructions coherently and stays on
	topic, which is more than I expected from something this small. It is not a production assistant. Its world
	knowledge is thin, its arithmetic is shaky, and it will make things up. Treat it as a demo of how much chat
	ability you can fit in 10M params, not as something to rely on.

	```
	USER: What is the capital of France?
	RODAN: The capital of France is Paris. It's a city known as the country, which has its own unique
	and cultural significance in Europe.

	USER: Give me a tip for staying healthy.
	RODAN: You can use a regular exercise routine to manage your body and improve overall well-being.
	It's important ... maintaining healthy weight management by burning calories.

	USER: Why do we sleep?
	RODAN: Sleep can be used to improve cognitive function and memory.
	```

	## How it was built

	At 10M params, ordinary supervised finetuning costs you base capability. We saw it in earlier Rodan
	experiments: a masked-SFT stage dropped board avg from about 34 to 32, the SFT tax. So instead of a separate
	masked-SFT stage, Rodan-Chat folds the instruction data into
	a continued-pretrain run mixed with 45% replay of the base's own domains (the approach Falcon used). The
	replay is what keeps the model from forgetting. Chat ability gets added while commonsense, science, and
	arithmetic stay roughly where they were.

	- Warm-start: Rodan-10M v9 (PLE-free, 10.41M). The tied embedding grows 8192→8194 for 2 ChatML tokens.
	- Data (73M tokens): 40M smol-smoltalk conversations in ChatML, plus 33M curated replay, full-sequence LM loss.
	- Optimizer: Muon on the 2D weights, AdamW elsewhere, low LR (1.2e-3, Muon 7e-3, below the base run), cosine, 6000 steps.
	- Result: perplexity dropped 24.9 → 14.6, and the base board avg held at 35.04.

	\| Source \| Share \| Role \|
	\|---\|---\|---\|
	\| smol-smoltalk (ChatML) \| 55% \| instruction / multi-turn chat \|
	\| Cosmopedia (replay) \| 9% \| commonsense anchor \|
	\| dolmino pes2o + StackExchange (replay) \| 9% \| knowledge anchor \|
	\| synthetic arithmetic (replay) \| 9% \| computation anchor \|
	\| FineMath (replay) \| 9% \| math anchor \|
	\| science-QA (replay) \| 9% \| science-MC anchor \|

	![Chat fold loss & data mix](chat_datamix.png)

	## Architecture

	Same as the base: decoder-only, dim 320, 8 layers, 8 heads, MQA with 1 KV head, SwiGLU 768, RMSNorm, RoPE
	base 200k, QK-norm, tied embeddings, value-residual, LRM. No PLE, since the probe on the base showed it was
	dead. Vocab is 8194 (the 8k byte-BPE set plus `<\|im_start\|>` and `<\|im_end\|>`).

	## Evaluation

	The base capability held; there was no SFT-tax collapse. Zero-shot lm-eval, limit 1000, ChatML-wrapped:

	\| Task \| Metric \| Rodan-Chat \| v9 base \| Δ \|
	\|---\|---\|---\|---\|---\|
	\| HellaSwag \| acc_norm \| 31.7 \| 30.1 \| +1.6 \|
	\| ARC-Easy \| acc_norm \| 35.3 \| 35.4 \| ≈ \|
	\| ARC-Challenge \| acc_norm \| 22.4 \| 22.2 \| ≈ \|
	\| PIQA \| acc \| 53.8 \| 55.5 \| −1.7 \|
	\| ArithMark-2 \| acc \| 25.8 \| 28.4 \| −2.6 \|
	\| Board avg (÷4) \| \| 35.04 \| 35.70 \| −0.66 \|

	The 0.66 dip is partly just the ChatML wrapper hurting multiple-choice loglikelihood, and it's nowhere near
	the 34→32 drop a naive finetune would have caused. The replay did its job.

	For instruction following itself, IFEval is close to useless at 10M: it grades strict constraint compliance,
	which really needs a model two or three orders of magnitude larger. So we measured the thing we actually care
	about instead. On 24 instruction prompts, an LLM judge compared Rodan-Chat against the v9 base, both decoded
	with the same repetition penalty. Chat won 14, tied 9, and lost 1, for a 93% win-rate excluding ties. The
	base tended to lose by sliding into code or rambling, while Chat gave coherent on-topic answers, several of
	them correct (Paris, photosynthesis producing glucose, the opposite of hot being cold, sleep helping memory).

	![Chat eval: board held + win-rate](chat_eval.png)

	We skipped a full IFEval score on purpose. It grades strict format compliance, which a 10M model fails
	near-uniformly, so the number carries no signal and isn't worth the long generative eval. The win-rate above
	is the instruction-following metric we trust at this scale.

	## Usage

	Wrap prompts in ChatML and decode with a repetition penalty. Tiny models loop badly under pure greedy
	decoding, and the penalty is the difference between gibberish and readable answers.

	```python
	ctx = f"<\|im_start\|>user\n{question}<\|im_end\|>\n<\|im_start\|>assistant\n"
	# greedy + repetition_penalty 1.3 + no-repeat-3gram ; stop on <\|im_end\|> (8193) or <\|endoftext\|> (0)
	```

	The settings I'd recommend: greedy, `repetition_penalty=1.3`, `no_repeat_ngram=3`, `max_new≈70`, low or zero
	temperature.

	## Limitations

	- ~10M params, English only, for research and teaching. Don't use it in production, for factual queries, or for advice.
	- Thin world knowledge, weak arithmetic, prone to making things up, near chance on abstract reasoning.
	- It needs a repetition penalty to stay coherent; pure greedy decoding loops.
	- No safety alignment. It imitates the shape of a chat reply without being a reliable assistant.

	## License

	Weights are open. Data falls under the respective dataset licenses (smol-smoltalk, Cosmopedia, dolmino-mix
	ODC-By, AllenAI QA sets, FineMath).