nrl-ai/vn-spell-correction-small — Vietnamese spell correction (BARTpho-syllable fine-tune)

Fixes typos, missed accents, and OCR-style char errors in Vietnamese text in one pass: Toi yu Vit Nam → Tôi yêu Việt Nam. Strictly more than diacritic restoration — handles letter-level mistakes, missing / extra characters, and OCR substitutions like o↔0, l↔1, m↔rn.

Fine-tuned from vinai/bartpho-syllable-base on the nrl-ai/vn-spell-correction-train corpus (459K (noisy, clean) Vietnamese pairs synthesized from a register-balanced Wiki+news mix via nom.text.noise).

Adoption gate: ✅ passed — passed (light avg 0.9459, heavy avg 0.9237).

Quick start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("nrl-ai/vn-spell-correction-small")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-spell-correction-small").eval()

text = "Toi yu Vit Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam

For batched inference (recommended for high-throughput pipelines):

from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-spell-correction-small")
fixed = restorer.predict_batch(noisy_sentences, batch_size=16)

Evaluation — 8-split spell-correction grid

Evaluation uses nrl-ai/vn-spell-correction-eval (2,098 pairs across 4 registers x 2 noise levels). Word accuracy after NFC + punctuation normalization on both sides.

Register Sents Word acc Sent exact Mean ms/sent
Modern business / news (light) 44 98.74 % 81.82 % 69
Formal / legal-prose (light) 65 92.34 % 78.46 % 125
Conversational (light) 179 92.42 % 71.51 % 50
Classical literary (light) 608 94.86 % 61.18 % 79
Modern business / news (heavy / OCR) 55 97.43 % 69.09 % 65
Formal / legal-prose (heavy / OCR) 72 89.12 % 66.67 % 120
Conversational (heavy / OCR) 287 92.47 % 66.90 % 48
Classical literary (heavy / OCR) 788 90.46 % 44.67 % 74

Each split corresponds to a (register, noise level) combination:

  • light noise — ~5 % char-level edit distance, models a person typing Vietnamese on a keyboard with a few accent slips and the occasional fat-finger.
  • heavy noise — ~15-20 % edit distance, models OCR output of a mid-quality scan with diacritic drops + char confusions.

How we compare

Where this model sits in the public Vietnamese spell-correction landscape — same 8-split grid for every measured row.

Model Family Params License light avg heavy avg
this → nrl-ai/vn-spell-correction-small ours ? Apache 2.0 94.59 92.37
bmd1905/vietnamese-correction-v2 public 400 M Apache 2.0 86.69 72.62
iAmHieu2012/vit5-vietnamese-spelling-correction public 220 M MIT 80.72 56.55

The two averaged columns:

  • light avg = mean word accuracy across the 4 light-noise splits.
  • heavy avg = mean word accuracy across the 4 heavy-noise splits.

Training

Intended use

  • Recommended: cleaning up noisy Vietnamese text — OCR output, user-generated text from non-VN-IME keyboards, form data with typos, social-media short-form. Strictly harder than diacritic restoration but covers it as a subset.
  • Not recommended: text generation, classification, sentiment, NER, or any task the input distribution doesn't match.

Limitations

  • In-distribution metric, real-world is harder — measured. Training and eval both use nom.text.noise. The synthetic 8-split numbers above measure how well we invert our noise generator. We also benchmark on a 150-sentence hand-curated OOD eval (6 registers, bootstrap 95 % CI):

    Slice this model Toshiiiii1 (public) bmd1905 (public)
    forum_25 59.45 % 60.11 % 59.02 %
    mobile_25 95.01 % 96.95 % 88.09 %
    telex_real_25 17.38 % 18.54 % 11.58 %
    ocr_25 93.62 % 94.22 % 47.42 %
    legal_real_25 95.09 % 93.80 % 54.90 %
    news_real_25 96.54 % 94.07 % 30.62 %
    Aggregate (n=150) 77.43 % 77.40 % 49.21 %

    Synthetic light_avg is 98.58 %, real-world aggregate is 77.43 %. The 21 pp gap is the cost of training only on light/telex_typo/heavy noise — those capture the surface of typos but not real Telex keystroke artefacts (dduwojc for được) or forum-style abbreviations (ko bt for không biết). On OOD this model ties with Toshiiiii1 (77.43 vs 77.40, within bootstrap CI). v0.2.29 retraining on the v2 multi-source corpus + comprehensive_noise() (which adds telex_grammar_noise() + mobile_noise()) is queued and targets a clear OOD lead, not just a synthetic one.

  • Heavy-noise corner cases. OCR outputs that drop entire words or add hallucinated text are out-of-scope; the noise generator we trained on caps edits per sentence (max 25 % edit ratio).

  • Long sequences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.

  • No grammar or stylistic correction. This model fixes character / syllable / diacritic errors but doesn't rewrite phrasing.

  • Confidence intervals on small splits. business_55 (44/55 sents) and formal_72 (65/72 sents) have ±3-4 pp 95 % CI; the larger literary_800 split has ±1 pp. Treat single-pp differences with care.

License & attribution

Released under Apache 2.0. Cite both this model and the base:

@misc{nom_vn_spell_correction_2026,
  title={Vietnamese Spell Correction — register-balanced fine-tune},
  author={Nguyen, Viet-Anh and {Neural Research Lab}},
  year={2026},
  howpublished={\url{https://huggingface.co/nrl-ai/vn-spell-correction-small}}
}

Training data inherits CC-BY-SA-4.0 (Wikipedia portion) + CC-BY-4.0 (news portion). Output text is best treated as CC-BY-SA-4.0 if you want to be safe.

Downloads last month
313
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nrl-ai/vn-spell-correction-small

Finetuned
(22)
this model
Quantizations
1 model

Dataset used to train nrl-ai/vn-spell-correction-small