Instructions to use codelion/sprog-9m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use codelion/sprog-9m with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir sprog-9m codelion/sprog-9m
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: mit | |
| library_name: mlx | |
| pipeline_tag: question-answering | |
| tags: | |
| - mlx | |
| - math | |
| - gsm8k | |
| - symbolic | |
| - math-word-problems | |
| - from-scratch | |
| datasets: | |
| - codelion/gsm8k-synth | |
| # SPROG-9M | |
| **S**ymbolic **PROG**ram solver — a **9.37M-parameter, from-scratch** model that solves | |
| grade-school math word problems **without any LLM at inference time**. | |
| Instead of generating text, SPROG abstracts the numbers in a question to slots | |
| (`[N0]`, `[N1]`, …) and predicts a **postfix program** over them, which is then executed | |
| symbolically. It draws 96 temperature samples and selects **a single answer** with a **free | |
| symbolic verifier** (0 trainable parameters) that never sees the ground truth — i.e. | |
| self-consistency selection, **not** a pass@k oracle. The whole thing runs on a | |
| CPU/Apple-Silicon GPU via MLX. | |
| Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth). | |
| ## Quick start | |
| ```bash | |
| pip install mlx numpy huggingface_hub | |
| huggingface-cli download codelion/sprog-9m --local-dir ./sprog-9m | |
| python sprog-9m/inference.py -q "A baker had 24 muffins. She sold 3/4 of them, then baked 10 more. How many muffins does she have now?" | |
| # Answer: 16.0 | |
| ``` | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| from pathlib import Path | |
| import sys | |
| p = snapshot_download("codelion/sprog-9m"); sys.path.insert(0, p) | |
| from inference import load_model, solve | |
| model, stoi, cfg = load_model(Path(p)) | |
| print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12. How many does he have?")) | |
| # 30.0 | |
| ``` | |
| ## Results | |
| Evaluated on the **full GSM8K test set** (1,319 problems), averaged over 3 training seeds. | |
| **The model commits to one answer per question.** It draws 96 temperature samples, then a | |
| 0-parameter symbolic verifier picks a **single** answer **without ever seeing the gold | |
| answer**. This is a single-answer accuracy (the self-consistency / maj@k family) — **not a | |
| pass@k oracle**. | |
| | metric | GSM8K test | what it measures | | |
| |---|---|---| | |
| | **verifier @ 96** (headline) | **11.8%** (best seed 12.6%) | verifier commits to one answer; gold never used | | |
| | plurality @ 96 | ≈9.3% | most-voted answer; gold never used | | |
| | pass@96 (oracle) | ≈39% | gold is *somewhere* in the 96 samples — an upper bound that **uses the gold to check** | | |
| | trainable parameters | 9.37M | — | | |
| | LLM used at inference | none | — | | |
| So **11.8% is a committed single answer, not pass@96** (that would be the ≈39% oracle). The | |
| free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a | |
| single decode. Results are stable across seeds (range 11.1–12.6%). | |
| **Why 96 samples?** Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288), | |
| but the verifier's *conversion* peaks around 64–96 then declines — extra samples add | |
| plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is | |
| the sweet spot between recall and selectability. | |
| ## How it was built | |
| - **Number-slot abstraction.** The model never sees raw numbers — they become slots, so it | |
| learns program *structure*, not arithmetic, and generalizes across values. | |
| - **Symbolic program target.** It predicts a postfix program (`[N0] [N1] * [N2] -`) executed | |
| by a tiny deterministic evaluator. | |
| - **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter | |
| symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken | |
| by vote frequency. | |
| - **Data is the main lever.** Trained on real GSM8K-train plus ≈117K LLM-generated | |
| GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K | |
| step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume | |
| or model size — a deeper/bigger model did not help beyond noise. | |
| ## Constraints (by design) | |
| - **≤10M trainable parameters** (9.37M) | |
| - **From scratch** — no pretrained weights, no frozen LLM, no linear probe on a foundation model | |
| - **LLM-free at inference** — pure MLX + symbolic execution | |
| - **Decontaminated** — training data has 0% 8-gram overlap with the GSM8K test set | |
| ## Files | |
| | file | purpose | | |
| |---|---| | |
| | `model.npz` | MLX weights (9.37M, d=304, 4 enc + 4 dec layers) | | |
| | `config.json` | architecture config | | |
| | `src_vocab.json` | source vocabulary (6,000 tokens) | | |
| | `inference.py` | self-contained inference (model + slot tokenizer + verifier) | | |
| ## Why not `mlx-lm` / `AutoModel`? | |
| SPROG is a custom **encoder-decoder seq2seq** with a **slot tokenizer** and a **symbolic | |
| decode + verify** pipeline — not a standard causal LM. So it ships its own `inference.py` | |
| rather than loading through `mlx-lm` or `transformers.AutoModel`. The script has no | |
| dependencies beyond `mlx` and `numpy`. | |
| ## Limitations | |
| This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go | |
| on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it | |
| misses many multi-step problems that require deeper reading comprehension. It is not a | |
| general math model and should not be used as one. | |
| ## License | |
| MIT. | |