nrl-ai/vn-spell-correction-base — Vietnamese spell correction (ViT5 fine-tune)
Fixes typos, missed accents, and OCR-style char errors in Vietnamese
text in one pass: Toi yu Vit Nam → Tôi yêu Việt Nam. Strictly more
than diacritic restoration — handles letter-level mistakes, missing /
extra characters, and OCR substitutions like o↔0, l↔1, m↔rn.
Fine-tuned from
VietAI/vit5-base on the
nrl-ai/vn-spell-correction-train
corpus (459K (noisy, clean) Vietnamese pairs synthesized from a
register-balanced Wiki+news mix via nom.text.noise).
Adoption gate: ✅ passed — passed (light avg 0.9832, heavy avg 0.9703).
Quick start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tok = AutoTokenizer.from_pretrained("nrl-ai/vn-spell-correction-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-spell-correction-base").eval()
text = "Toi yu Vit Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam
For batched inference (recommended for high-throughput pipelines):
from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-spell-correction-base")
fixed = restorer.predict_batch(noisy_sentences, batch_size=16)
Evaluation — 8-split spell-correction grid
Evaluation uses
nrl-ai/vn-spell-correction-eval
(2,098 pairs across 4 registers x 2 noise levels). Word accuracy after
NFC + punctuation normalization on both sides.
| Register | Sents | Word acc | Sent exact | Mean ms/sent |
|---|---|---|---|---|
| Modern business / news (light) | 44 | 98.74 % | 84.09 % | 152 |
| Formal / legal-prose (light) | 65 | 99.75 % | 93.85 % | 290 |
| Conversational (light) | 179 | 97.68 % | 81.56 % | 106 |
| Classical literary (light) | 608 | 97.11 % | 73.03 % | 171 |
| Modern business / news (heavy / OCR) | 55 | 98.97 % | 85.45 % | 146 |
| Formal / legal-prose (heavy / OCR) | 72 | 99.05 % | 83.33 % | 273 |
| Conversational (heavy / OCR) | 287 | 95.54 % | 73.52 % | 103 |
| Classical literary (heavy / OCR) | 788 | 94.56 % | 56.85 % | 160 |
Each split corresponds to a (register, noise level) combination:
- light noise — ~5 % char-level edit distance, models a person typing Vietnamese on a keyboard with a few accent slips and the occasional fat-finger.
- heavy noise — ~15-20 % edit distance, models OCR output of a mid-quality scan with diacritic drops + char confusions.
How we compare
Where this model sits in the public Vietnamese spell-correction landscape — same 8-split grid for every measured row.
| Model | Family | Params | License | light avg | heavy avg |
|---|---|---|---|---|---|
this → nrl-ai/vn-spell-correction-base |
ours | ? | Apache 2.0 | 98.32 | 97.03 |
bmd1905/vietnamese-correction-v2 |
public | 400 M | Apache 2.0 | 86.69 | 72.62 |
iAmHieu2012/vit5-vietnamese-spelling-correction |
public | 220 M | MIT | 80.72 | 56.55 |
The two averaged columns:
- light avg = mean word accuracy across the 4 light-noise splits.
- heavy avg = mean word accuracy across the 4 heavy-noise splits.
Training
- Base:
VietAI/vit5-base(MIT license) - Corpus: 545,000 (noisy, clean) pairs from
nrl-ai/vn-spell-correction-train. Eval-leak guarded againstnrl-ai/vn-spell-correction-evalandnrl-ai/vn-diacritic-eval. - Validation: 5,000 held-out pairs.
- Epochs: 5
- Effective batch size: 32 (32 per device, grad-accum 1)
- Learning rate: 0.0005 with
cosineschedule, 500 warmup steps - Precision: bf16
- Sequence length: input 256, target 256
- Early stopping: patience=0 on
eval_loss - Training time: 215.0 min on a single NVIDIA RTX 3090 24 GB
- Seed: 42
Intended use
- Recommended: cleaning up noisy Vietnamese text — OCR output, user-generated text from non-VN-IME keyboards, form data with typos, social-media short-form. Strictly harder than diacritic restoration but covers it as a subset.
- Not recommended: text generation, classification, sentiment, NER, or any task the input distribution doesn't match.
Limitations
In-distribution metric, real-world is harder — measured. Training and eval both use
nom.text.noise. The synthetic 8-split numbers above measure how well we invert our noise generator. We also benchmark on a 150-sentence hand-curated OOD eval (6 registers, bootstrap 95 % CI):Slice this model Toshiiiii1 (public) bmd1905 (public) forum_25 59.45 % 60.11 % 59.02 % mobile_25 95.01 % 96.95 % 88.09 % telex_real_25 17.38 % 18.54 % 11.58 % ocr_25 93.62 % 94.22 % 47.42 % legal_real_25 95.09 % 93.80 % 54.90 % news_real_25 96.54 % 94.07 % 30.62 % Aggregate (n=150) 77.43 % 77.40 % 49.21 % Synthetic light_avg is 98.58 %, real-world aggregate is 77.43 %. The 21 pp gap is the cost of training only on
light/telex_typo/heavynoise — those capture the surface of typos but not real Telex keystroke artefacts (dduwojcforđược) or forum-style abbreviations (ko btforkhông biết). On OOD this model ties withToshiiiii1(77.43 vs 77.40, within bootstrap CI). v0.2.29 retraining on the v2 multi-source corpus +comprehensive_noise()(which addstelex_grammar_noise()+mobile_noise()) is queued and targets a clear OOD lead, not just a synthetic one.Heavy-noise corner cases. OCR outputs that drop entire words or add hallucinated text are out-of-scope; the noise generator we trained on caps edits per sentence (max 25 % edit ratio).
Long sequences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.
No grammar or stylistic correction. This model fixes character / syllable / diacritic errors but doesn't rewrite phrasing.
Confidence intervals on small splits. business_55 (44/55 sents) and formal_72 (65/72 sents) have ±3-4 pp 95 % CI; the larger literary_800 split has ±1 pp. Treat single-pp differences with care.
License & attribution
Released under Apache 2.0. Cite both this model and the base:
@misc{nom_vn_spell_correction_2026,
title={Vietnamese Spell Correction — register-balanced fine-tune},
author={Nguyen, Viet-Anh and {Neural Research Lab}},
year={2026},
howpublished={\url{https://huggingface.co/nrl-ai/vn-spell-correction-base}}
}
Training data inherits CC-BY-SA-4.0 (Wikipedia portion) + CC-BY-4.0 (news portion). Output text is best treated as CC-BY-SA-4.0 if you want to be safe.
- Downloads last month
- 505