nrl-ai/vn-spell-correction-base — Vietnamese spell correction (ViT5 fine-tune)

Fixes typos, missed accents, and OCR-style char errors in Vietnamese text in one pass: Toi yu Vit Nam → Tôi yêu Việt Nam. Strictly more than diacritic restoration — handles letter-level mistakes, missing / extra characters, and OCR substitutions like o↔0, l↔1, m↔rn.

Fine-tuned from VietAI/vit5-base on the nrl-ai/vn-spell-correction-train corpus (459K (noisy, clean) Vietnamese pairs synthesized from a register-balanced Wiki+news mix via nom.text.noise).

Adoption gate: ✅ passed — passed (light avg 0.9832, heavy avg 0.9703).

Quick start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("nrl-ai/vn-spell-correction-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-spell-correction-base").eval()

text = "Toi yu Vit Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam

For batched inference (recommended for high-throughput pipelines):

from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-spell-correction-base")
fixed = restorer.predict_batch(noisy_sentences, batch_size=16)

Evaluation — 8-split spell-correction grid

Evaluation uses nrl-ai/vn-spell-correction-eval (2,098 pairs across 4 registers x 2 noise levels). Word accuracy after NFC + punctuation normalization on both sides.

Register Sents Word acc Sent exact Mean ms/sent
Modern business / news (light) 44 98.74 % 84.09 % 152
Formal / legal-prose (light) 65 99.75 % 93.85 % 290
Conversational (light) 179 97.68 % 81.56 % 106
Classical literary (light) 608 97.11 % 73.03 % 171
Modern business / news (heavy / OCR) 55 98.97 % 85.45 % 146
Formal / legal-prose (heavy / OCR) 72 99.05 % 83.33 % 273
Conversational (heavy / OCR) 287 95.54 % 73.52 % 103
Classical literary (heavy / OCR) 788 94.56 % 56.85 % 160

Each split corresponds to a (register, noise level) combination:

  • light noise — ~5 % char-level edit distance, models a person typing Vietnamese on a keyboard with a few accent slips and the occasional fat-finger.
  • heavy noise — ~15-20 % edit distance, models OCR output of a mid-quality scan with diacritic drops + char confusions.

How we compare

Where this model sits in the public Vietnamese spell-correction landscape — same 8-split grid for every measured row.

Model Family Params License light avg heavy avg
this → nrl-ai/vn-spell-correction-base ours ? Apache 2.0 98.32 97.03
bmd1905/vietnamese-correction-v2 public 400 M Apache 2.0 86.69 72.62
iAmHieu2012/vit5-vietnamese-spelling-correction public 220 M MIT 80.72 56.55

The two averaged columns:

  • light avg = mean word accuracy across the 4 light-noise splits.
  • heavy avg = mean word accuracy across the 4 heavy-noise splits.

Training

  • Base: VietAI/vit5-base (MIT license)
  • Corpus: 545,000 (noisy, clean) pairs from nrl-ai/vn-spell-correction-train. Eval-leak guarded against nrl-ai/vn-spell-correction-eval and nrl-ai/vn-diacritic-eval.
  • Validation: 5,000 held-out pairs.
  • Epochs: 5
  • Effective batch size: 32 (32 per device, grad-accum 1)
  • Learning rate: 0.0005 with cosine schedule, 500 warmup steps
  • Precision: bf16
  • Sequence length: input 256, target 256
  • Early stopping: patience=0 on eval_loss
  • Training time: 215.0 min on a single NVIDIA RTX 3090 24 GB
  • Seed: 42

Intended use

  • Recommended: cleaning up noisy Vietnamese text — OCR output, user-generated text from non-VN-IME keyboards, form data with typos, social-media short-form. Strictly harder than diacritic restoration but covers it as a subset.
  • Not recommended: text generation, classification, sentiment, NER, or any task the input distribution doesn't match.

Limitations

  • In-distribution metric, real-world is harder — measured. Training and eval both use nom.text.noise. The synthetic 8-split numbers above measure how well we invert our noise generator. We also benchmark on a 150-sentence hand-curated OOD eval (6 registers, bootstrap 95 % CI):

    Slice this model Toshiiiii1 (public) bmd1905 (public)
    forum_25 59.45 % 60.11 % 59.02 %
    mobile_25 95.01 % 96.95 % 88.09 %
    telex_real_25 17.38 % 18.54 % 11.58 %
    ocr_25 93.62 % 94.22 % 47.42 %
    legal_real_25 95.09 % 93.80 % 54.90 %
    news_real_25 96.54 % 94.07 % 30.62 %
    Aggregate (n=150) 77.43 % 77.40 % 49.21 %

    Synthetic light_avg is 98.58 %, real-world aggregate is 77.43 %. The 21 pp gap is the cost of training only on light/telex_typo/heavy noise — those capture the surface of typos but not real Telex keystroke artefacts (dduwojc for được) or forum-style abbreviations (ko bt for không biết). On OOD this model ties with Toshiiiii1 (77.43 vs 77.40, within bootstrap CI). v0.2.29 retraining on the v2 multi-source corpus + comprehensive_noise() (which adds telex_grammar_noise() + mobile_noise()) is queued and targets a clear OOD lead, not just a synthetic one.

  • Heavy-noise corner cases. OCR outputs that drop entire words or add hallucinated text are out-of-scope; the noise generator we trained on caps edits per sentence (max 25 % edit ratio).

  • Long sequences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.

  • No grammar or stylistic correction. This model fixes character / syllable / diacritic errors but doesn't rewrite phrasing.

  • Confidence intervals on small splits. business_55 (44/55 sents) and formal_72 (65/72 sents) have ±3-4 pp 95 % CI; the larger literary_800 split has ±1 pp. Treat single-pp differences with care.

License & attribution

Released under Apache 2.0. Cite both this model and the base:

@misc{nom_vn_spell_correction_2026,
  title={Vietnamese Spell Correction — register-balanced fine-tune},
  author={Nguyen, Viet-Anh and {Neural Research Lab}},
  year={2026},
  howpublished={\url{https://huggingface.co/nrl-ai/vn-spell-correction-base}}
}

Training data inherits CC-BY-SA-4.0 (Wikipedia portion) + CC-BY-4.0 (news portion). Output text is best treated as CC-BY-SA-4.0 if you want to be safe.

Downloads last month
505
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nrl-ai/vn-spell-correction-base

Base model

VietAI/vit5-base
Finetuned
(90)
this model
Quantizations
1 model

Dataset used to train nrl-ai/vn-spell-correction-base