nano-g2p

Spells out how a word is pronounced — though → DH OW, tough → T AH F, through → TH R UW, cough → K AA F. English spelling is famously not phonetic (though/tough/through/cough/rough/bough — six different sounds from -ough), so no rule does this; pronunciation lives only in a learned lexicon. A ~1M-parameter (1,016,960) byte-level transformer that learns grapheme→phoneme patterns and generalises to words it never saw.

Code, benchmark, tests, technical report: https://github.com/vukrosic/nano-g2p
Runs on CPU in milliseconds. No tokenizer file — raw UTF-8 bytes. ARPAbet, stress stripped.

Benchmark (held-out words, N=2000 — unseen in training)

	exact-match	phoneme-acc
nano-g2p model	72.1%	92.7%
per-letter script	8.4%	54.7%
nearest-word lookup	1.8%	65.6%

The model beats both scripts by a wide margin on words it never saw — it learned generalisable pronunciation rules, not a memorised dictionary or fuzzy match. The cleanest must-beat-a-script result of the portfolio.

Usage

pip install torch safetensors numpy
# grab modeling_nano_g2p.py + config.json from the GitHub repo

from modeling_nano_g2p import load, pronounce
m = load("model.safetensors", "config.json")
pronounce(m, "though")   # -> "DH OW"
pronounce(m, "tough")    # -> "T AH F"

How it was trained

Frozen public lexicon: top common English words (wordfreq) with their CMU pronunciation (cmudict), stress stripped, ~35k words; 2,000 held out as an unseen test set. SFT, prompt masked. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 20k steps, AdamW, cosine LR. Full recipe and reproduction in the GitHub repo.

MIT. Built by Vuk Rosić. Pronunciations derived from CMUdict (BSD-style license).

Downloads last month: 49

Safetensors

Model size

1.02M params

Tensor type

F32