codelion commited on
Commit
bf2f12f
·
verified ·
1 Parent(s): 758ab1c

Fix strikethrough rendering (tildes -> approx symbol)

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -55,7 +55,7 @@ Evaluated on the **full GSM8K test set** (1,319 problems), 3 training seeds:
55
  | metric | GSM8K test |
56
  |---|---|
57
  | **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
58
- | accuracy (plurality vote, no verifier) | ~9.3% |
59
  | trainable parameters | **9.37M** |
60
  | LLM used at inference | **none** |
61
 
@@ -71,7 +71,7 @@ stable across seeds on the test set (range 11.1–12.6%).
71
  - **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
72
  symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
73
  by vote frequency.
74
- - **Data is the main lever.** Trained on real GSM8K-train plus ~117K LLM-generated
75
  GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
76
  step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
77
  or model size — a deeper/bigger model did not help beyond noise.
@@ -102,7 +102,7 @@ dependencies beyond `mlx` and `numpy`.
102
  ## Limitations
103
 
104
  This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
105
- on GSM8K (~12%). It handles 1–4 step arithmetic word problems with common operations; it
106
  misses many multi-step problems that require deeper reading comprehension. It is not a
107
  general math model and should not be used as one.
108
 
 
55
  | metric | GSM8K test |
56
  |---|---|
57
  | **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
58
+ | accuracy (plurality vote, no verifier) | 9.3% |
59
  | trainable parameters | **9.37M** |
60
  | LLM used at inference | **none** |
61
 
 
71
  - **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
72
  symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
73
  by vote frequency.
74
+ - **Data is the main lever.** Trained on real GSM8K-train plus 117K LLM-generated
75
  GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
76
  step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
77
  or model size — a deeper/bigger model did not help beyond noise.
 
102
  ## Limitations
103
 
104
  This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
105
+ on GSM8K (12%). It handles 1–4 step arithmetic word problems with common operations; it
106
  misses many multi-step problems that require deeper reading comprehension. It is not a
107
  general math model and should not be used as one.
108