nano-proofread

Fixes the writing errors a spell-checker can't see — their going to win → they're going to win, its raining again → it's raining again, the the cat sat → the cat sat. The mistakes are real words (their/there/they're are all spelled correctly), so a spell-checker stays silent; which one is right depends on the surrounding words. A ~1M-parameter (1,016,960) byte-level transformer that reads the context and picks.

Scope (a fixed confusion set, not general grammar): their/there/they're, your/you're, its/it's, then/than, to/too, could have/could of, and doubled words.

Code, benchmark, tests, technical report: https://github.com/vukrosic/nano-proofread
Runs on CPU in milliseconds. No tokenizer file — raw UTF-8 bytes.

Benchmark

	model	best context-free script
overall (held-out, N=4000)	100.0%	49.2%
context slice (N=2030)	100.0%	0.0%
out-of-distribution (N=25)	92.0%	36.0%

The script is 0% on the context slice by construction — it can only emit its default member, which is wrong exactly where context decides. The number that matters is the last row: on 25 natural phrases matching no training template, the model beats the script by 56 points — it learned the grammatical cue, not memorised sentences. (An earlier 14-template version scored 99% on a same-template split but failed on real phrases; the frame-based generator + this OOD test is what keeps the result honest.)

Usage

pip install torch safetensors numpy
# grab modeling_nano_proofread.py + config.json from the GitHub repo

from modeling_nano_proofread import load, proofread
m = load("model.safetensors", "config.json")
proofread(m, "their going to win")   # -> "they're going to win"
proofread(m, "its raining again")    # -> "it's raining again"

How it was trained

100% code-generated, correct by construction: build a correct phrase from ~65 grammatical frames with rich fillers, then inject one error (swap the confusion word, or double a word); ~15% identity. SFT, prompt masked. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 24k steps, AdamW, cosine LR. Full recipe and reproduction in the GitHub repo.

MIT. Built by Vuk Rosić.

Downloads last month: 2

Safetensors

Model size

1.02M params

Tensor type

F32