sprog-9m / README.md

Clarify metric: verifier@96 self-consistency (single answer), not pass@96

ec3988b verified 1 day ago

5.14 kB

	---
	license: mit
	library_name: mlx
	pipeline_tag: question-answering
	tags:
	- mlx
	- math
	- gsm8k
	- symbolic
	- math-word-problems
	- from-scratch
	datasets:
	- codelion/gsm8k-synth
	---

	# SPROG-9M

	Symbolic PROGram solver — a 9.37M-parameter, from-scratch model that solves
	grade-school math word problems without any LLM at inference time.

	Instead of generating text, SPROG abstracts the numbers in a question to slots
	(`[N0]`, `[N1]`, …) and predicts a postfix program over them, which is then executed
	symbolically. It draws 96 temperature samples and selects a single answer with a **free
	symbolic verifier** (0 trainable parameters) that never sees the ground truth — i.e.
	self-consistency selection, not a pass@k oracle. The whole thing runs on a
	CPU/Apple-Silicon GPU via MLX.

	Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth).

	## Quick start

	```bash
	pip install mlx numpy huggingface_hub
	huggingface-cli download codelion/sprog-9m --local-dir ./sprog-9m
	python sprog-9m/inference.py -q "A baker had 24 muffins. She sold 3/4 of them, then baked 10 more. How many muffins does she have now?"
	# Answer: 16.0
	```

	```python
	from huggingface_hub import snapshot_download
	from pathlib import Path
	import sys

	p = snapshot_download("codelion/sprog-9m"); sys.path.insert(0, p)
	from inference import load_model, solve

	model, stoi, cfg = load_model(Path(p))
	print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12. How many does he have?"))
	# 30.0
	```

	## Results

	Evaluated on the full GSM8K test set (1,319 problems), averaged over 3 training seeds.

	The model commits to one answer per question. It draws 96 temperature samples, then a
	0-parameter symbolic verifier picks a single answer **without ever seeing the gold
	answer. This is a single-answer accuracy (the self-consistency / maj@k family) — not a
	pass@k oracle**.

	\| metric \| GSM8K test \| what it measures \|
	\|---\|---\|---\|
	\| verifier @ 96 (headline) \| 11.8% (best seed 12.6%) \| verifier commits to one answer; gold never used \|
	\| plurality @ 96 \| ≈9.3% \| most-voted answer; gold never used \|
	\| pass@96 (oracle) \| ≈39% \| gold is somewhere in the 96 samples — an upper bound that uses the gold to check \|
	\| trainable parameters \| 9.37M \| — \|
	\| LLM used at inference \| none \| — \|

	So 11.8% is a committed single answer, not pass@96 (that would be the ≈39% oracle). The
	free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a
	single decode. Results are stable across seeds (range 11.1–12.6%).

	Why 96 samples? Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288),
	but the verifier's conversion peaks around 64–96 then declines — extra samples add
	plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is
	the sweet spot between recall and selectability.

	## How it was built

	- Number-slot abstraction. The model never sees raw numbers — they become slots, so it
	learns program structure, not arithmetic, and generalizes across values.
	- Symbolic program target. It predicts a postfix program (`[N0] [N1] * [N2] -`) executed
	by a tiny deterministic evaluator.
	- Self-consistency + free verifier. 96 sampled programs are scored by a 0-parameter
	symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
	by vote frequency.
	- Data is the main lever. Trained on real GSM8K-train plus ≈117K LLM-generated
	GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
	step-distribution and rigorous decontamination** (0% test overlap), not raw data volume
	or model size — a deeper/bigger model did not help beyond noise.

	## Constraints (by design)

	- ≤10M trainable parameters (9.37M)
	- From scratch — no pretrained weights, no frozen LLM, no linear probe on a foundation model
	- LLM-free at inference — pure MLX + symbolic execution
	- Decontaminated — training data has 0% 8-gram overlap with the GSM8K test set

	## Files

	\| file \| purpose \|
	\|---\|---\|
	\| `model.npz` \| MLX weights (9.37M, d=304, 4 enc + 4 dec layers) \|
	\| `config.json` \| architecture config \|
	\| `src_vocab.json` \| source vocabulary (6,000 tokens) \|
	\| `inference.py` \| self-contained inference (model + slot tokenizer + verifier) \|

	## Why not `mlx-lm` / `AutoModel`?

	SPROG is a custom encoder-decoder seq2seq with a slot tokenizer and a **symbolic
	decode + verify** pipeline — not a standard causal LM. So it ships its own `inference.py`
	rather than loading through `mlx-lm` or `transformers.AutoModel`. The script has no
	dependencies beyond `mlx` and `numpy`.

	## Limitations

	This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
	on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it
	misses many multi-step problems that require deeper reading comprehension. It is not a
	general math model and should not be used as one.

	## License

	MIT.