nano-spell

Corrects a misspelled word to its intended word — recieve → receive, teh → the, freind → friend, thier → their. Which real word a typo most likely meant lives only in a frequency-weighted lexicon, so the obvious script (nearest dictionary word by edit distance) guesses wrong whenever the typo lands as close to another word. A ~1M-parameter (1,016,960) byte-level transformer, trained 100% on code-generated data.

Code, benchmark, tests, technical report: https://github.com/vukrosic/nano-spell
Runs on CPU in milliseconds. No tokenizer file — raw UTF-8 bytes.

Benchmark (held-out, seed 987654321, N=4000)

	model	identity	nearest-naive	nearest-freq
overall	86.8%	16.8%	68.0%	73.7%
hard slice (naive wrong, N=1280)	68.7%	—	0.0%	29.8%

Unlike pure lexicon-recovery tasks (where a frequency dictionary ties the model), nano-spell beats both scripts, including the frequency dictionary. On the hard slice — the typos a nearest-dictionary lookup gets wrong — the model recovers 68.7% where the frequency dictionary manages only 29.8%. A clean must-beat-a-script result with no asterisk. Out-of-vocab words: 14% — the prior is the ~480-word training vocabulary; reported, not hidden.

Usage

pip install torch safetensors numpy
# grab modeling_nano_spell.py + config.json from the GitHub repo

from modeling_nano_spell import load, correct
m = load("model.safetensors", "config.json")
correct(m, "recieve")   # -> "receive"
correct(m, "teh")       # -> "the"
correct(m, "world")     # -> "world"  (already correct, left alone)

How it was trained

Code-generated data: sample a real word from a fixed, frequency-ordered ~480-word vocabulary (Zipf-weighted), inject 1–2 realistic typos (delete / insert / keyboard-neighbour / transpose / double), ~15% identity. Label correct by construction. SFT, prompt masked. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 12k steps, AdamW, cosine LR. Full recipe and reproduction in the GitHub repo.

MIT. Built by Vuk Rosić.

Downloads last month: 1

Safetensors

Model size

1.02M params

Tensor type

F32