Instructions to use codelion/sprog-9m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use codelion/sprog-9m with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir sprog-9m codelion/sprog-9m
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Clarify metric: verifier@96 self-consistency (single answer), not pass@96
Browse files
README.md
CHANGED
|
@@ -20,9 +20,10 @@ grade-school math word problems **without any LLM at inference time**.
|
|
| 20 |
|
| 21 |
Instead of generating text, SPROG abstracts the numbers in a question to slots
|
| 22 |
(`[N0]`, `[N1]`, …) and predicts a **postfix program** over them, which is then executed
|
| 23 |
-
symbolically. It draws 96 temperature samples and selects
|
| 24 |
-
symbolic verifier** (0 trainable parameters)
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth).
|
| 28 |
|
|
@@ -50,17 +51,29 @@ print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12
|
|
| 50 |
|
| 51 |
## Results
|
| 52 |
|
| 53 |
-
Evaluated on the **full GSM8K test set** (1,319 problems), 3 training seeds
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
## How it was built
|
| 66 |
|
|
|
|
| 20 |
|
| 21 |
Instead of generating text, SPROG abstracts the numbers in a question to slots
|
| 22 |
(`[N0]`, `[N1]`, …) and predicts a **postfix program** over them, which is then executed
|
| 23 |
+
symbolically. It draws 96 temperature samples and selects **a single answer** with a **free
|
| 24 |
+
symbolic verifier** (0 trainable parameters) that never sees the ground truth — i.e.
|
| 25 |
+
self-consistency selection, **not** a pass@k oracle. The whole thing runs on a
|
| 26 |
+
CPU/Apple-Silicon GPU via MLX.
|
| 27 |
|
| 28 |
Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth).
|
| 29 |
|
|
|
|
| 51 |
|
| 52 |
## Results
|
| 53 |
|
| 54 |
+
Evaluated on the **full GSM8K test set** (1,319 problems), averaged over 3 training seeds.
|
| 55 |
+
|
| 56 |
+
**The model commits to one answer per question.** It draws 96 temperature samples, then a
|
| 57 |
+
0-parameter symbolic verifier picks a **single** answer **without ever seeing the gold
|
| 58 |
+
answer**. This is a single-answer accuracy (the self-consistency / maj@k family) — **not a
|
| 59 |
+
pass@k oracle**.
|
| 60 |
+
|
| 61 |
+
| metric | GSM8K test | what it measures |
|
| 62 |
+
|---|---|---|
|
| 63 |
+
| **verifier @ 96** (headline) | **11.8%** (best seed 12.6%) | verifier commits to one answer; gold never used |
|
| 64 |
+
| plurality @ 96 | ≈9.3% | most-voted answer; gold never used |
|
| 65 |
+
| pass@96 (oracle) | ≈39% | gold is *somewhere* in the 96 samples — an upper bound that **uses the gold to check** |
|
| 66 |
+
| trainable parameters | 9.37M | — |
|
| 67 |
+
| LLM used at inference | none | — |
|
| 68 |
+
|
| 69 |
+
So **11.8% is a committed single answer, not pass@96** (that would be the ≈39% oracle). The
|
| 70 |
+
free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a
|
| 71 |
+
single decode. Results are stable across seeds (range 11.1–12.6%).
|
| 72 |
+
|
| 73 |
+
**Why 96 samples?** Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288),
|
| 74 |
+
but the verifier's *conversion* peaks around 64–96 then declines — extra samples add
|
| 75 |
+
plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is
|
| 76 |
+
the sweet spot between recall and selectability.
|
| 77 |
|
| 78 |
## How it was built
|
| 79 |
|