codelion commited on
Commit
ec3988b
·
verified ·
1 Parent(s): bf2f12f

Clarify metric: verifier@96 self-consistency (single answer), not pass@96

Browse files
Files changed (1) hide show
  1. README.md +27 -14
README.md CHANGED
@@ -20,9 +20,10 @@ grade-school math word problems **without any LLM at inference time**.
20
 
21
  Instead of generating text, SPROG abstracts the numbers in a question to slots
22
  (`[N0]`, `[N1]`, …) and predicts a **postfix program** over them, which is then executed
23
- symbolically. It draws 96 temperature samples and selects the answer with a **free
24
- symbolic verifier** (0 trainable parameters). The whole thing runs on a CPU/Apple-Silicon
25
- GPU via MLX.
 
26
 
27
  Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth).
28
 
@@ -50,17 +51,29 @@ print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12
50
 
51
  ## Results
52
 
53
- Evaluated on the **full GSM8K test set** (1,319 problems), 3 training seeds:
54
-
55
- | metric | GSM8K test |
56
- |---|---|
57
- | **accuracy (symbolic verifier)** | **11.8%** mean12.6% best seed |
58
- | accuracy (plurality vote, no verifier) | ≈9.3% |
59
- | trainable parameters | **9.37M** |
60
- | LLM used at inference | **none** |
61
-
62
- The free symbolic verifier adds **+2.5 points** over plain majority voting. The model is
63
- stable across seeds on the test set (range 11.1–12.6%).
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## How it was built
66
 
 
20
 
21
  Instead of generating text, SPROG abstracts the numbers in a question to slots
22
  (`[N0]`, `[N1]`, …) and predicts a **postfix program** over them, which is then executed
23
+ symbolically. It draws 96 temperature samples and selects **a single answer** with a **free
24
+ symbolic verifier** (0 trainable parameters) that never sees the ground truth — i.e.
25
+ self-consistency selection, **not** a pass@k oracle. The whole thing runs on a
26
+ CPU/Apple-Silicon GPU via MLX.
27
 
28
  Trained on [`codelion/gsm8k-synth`](https://huggingface.co/datasets/codelion/gsm8k-synth).
29
 
 
51
 
52
  ## Results
53
 
54
+ Evaluated on the **full GSM8K test set** (1,319 problems), averaged over 3 training seeds.
55
+
56
+ **The model commits to one answer per question.** It draws 96 temperature samples, then a
57
+ 0-parameter symbolic verifier picks a **single** answer **without ever seeing the gold
58
+ answer**. This is a single-answer accuracy (the self-consistency / maj@k family)**not a
59
+ pass@k oracle**.
60
+
61
+ | metric | GSM8K test | what it measures |
62
+ |---|---|---|
63
+ | **verifier @ 96** (headline) | **11.8%** (best seed 12.6%) | verifier commits to one answer; gold never used |
64
+ | plurality @ 96 | ≈9.3% | most-voted answer; gold never used |
65
+ | pass@96 (oracle) | ≈39% | gold is *somewhere* in the 96 samples — an upper bound that **uses the gold to check** |
66
+ | trainable parameters | 9.37M | — |
67
+ | LLM used at inference | none | — |
68
+
69
+ So **11.8% is a committed single answer, not pass@96** (that would be the ≈39% oracle). The
70
+ free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a
71
+ single decode. Results are stable across seeds (range 11.1–12.6%).
72
+
73
+ **Why 96 samples?** Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288),
74
+ but the verifier's *conversion* peaks around 64–96 then declines — extra samples add
75
+ plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is
76
+ the sweet spot between recall and selectability.
77
 
78
  ## How it was built
79