nano-g2p

Spells out how a word is pronounced — thoughDH OW, toughT AH F, throughTH R UW, coughK AA F. English spelling is famously not phonetic (though/tough/through/cough/rough/bough — six different sounds from -ough), so no rule does this; pronunciation lives only in a learned lexicon. A ~1M-parameter (1,016,960) byte-level transformer that learns grapheme→phoneme patterns and generalises to words it never saw.

Benchmark (held-out words, N=2000 — unseen in training)

exact-match phoneme-acc
nano-g2p model 72.1% 92.7%
per-letter script 8.4% 54.7%
nearest-word lookup 1.8% 65.6%

The model beats both scripts by a wide margin on words it never saw — it learned generalisable pronunciation rules, not a memorised dictionary or fuzzy match. The cleanest must-beat-a-script result of the portfolio.

Usage

pip install torch safetensors numpy
# grab modeling_nano_g2p.py + config.json from the GitHub repo
from modeling_nano_g2p import load, pronounce
m = load("model.safetensors", "config.json")
pronounce(m, "though")   # -> "DH OW"
pronounce(m, "tough")    # -> "T AH F"

How it was trained

Frozen public lexicon: top common English words (wordfreq) with their CMU pronunciation (cmudict), stress stripped, ~35k words; 2,000 held out as an unseen test set. SFT, prompt masked. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 20k steps, AdamW, cosine LR. Full recipe and reproduction in the GitHub repo.

MIT. Built by Vuk Rosić. Pronunciations derived from CMUdict (BSD-style license).

Downloads last month
49
Safetensors
Model size
1.02M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support