nrl-ai/vn-diacritic-small — Vietnamese diacritic restoration (BARTpho-syllable fine-tune)

Restores diacritics on Vietnamese text written without them (Toi yeu Viet NamTôi yêu Việt Nam). Fine-tuned from vinai/bartpho-syllable-base on a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized).

Adoption gate: ⚠️ did not pass (business_55 word_accuracy 0.9444 < gate 0.9600).

Quick start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("nrl-ai/vn-diacritic-small")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-diacritic-small").eval()

text = "Toi yeu Viet Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam

For the full pipeline (with rule-based + LLM fallbacks), use the nom-vn Python package:

from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-diacritic-small")
restorer("Toi yeu Viet Nam")  # 'Tôi yêu Việt Nam'

Evaluation — 4-register matrix

Measured against Toshiiiii1/Vietnamese_diacritics_restoration_5th (public SOTA at the time of training). Word accuracy after Unicode NFC

  • punctuation normalization on both sides.
Register Sents Word acc Δ vs Toshiiiii1 Mean ms/sent
Formal / legal-prose (UDHR, public domain) 72 91.51 % -6.63 pp 117
Modern business / contracts / news (CC0) 55 94.44 % -3.37 pp 66
Conversational (Tatoeba, CC-BY 2.0 FR) 300 90.68 % -3.26 pp 48
Classical literary (UD-VTB, CC-BY-SA-4.0) 800 86.33 % -3.07 pp 78

Each eval corpus is open-license and reproducible from the nom-vn repo:

  • business_55benchmarks/data/diacritic_eval_v0.txt (CC0)
  • literary_udvtbbenchmarks/data/ud_vi_vtb/test.conllu (CC-BY-SA-4.0)
  • conversational_300benchmarks/data/tatoeba_vi/diacritic_eval_300.txt (CC-BY 2.0 FR)
  • formal_udhrbenchmarks/data/udhr_vi/diacritic_eval_udhr.txt (public domain)

How we compare

Where this model sits in the public Vietnamese diacritic-restoration landscape — same 4-register grid for every measured row. Bold = best in column. Cells marked "—" weren't run on that register.

Model Family Params License formal_72 business_55 conv_300 literary_800
Toshiiiii1/...5th public 200 M Apache 2.0 98.14 97.81 93.94 89.40
nrl-ai/vn-diacritic-vit5-base ours 220 M Apache 2.0 99.43 94.98 94.12 90.24
thisnrl-ai/vn-diacritic-small ours 115 M Apache 2.0 91.51 94.44 90.68 86.33
qthuan2604/ViT5_Restore_Diacritics_Vietnamese public 220 M MIT 90.59
qthuan2604/BARTPho_Syllable_Restore_Diacritics_Vietnamese public 115 M MIT 83.92
OpenAI gpt-4o-mini (cloud) external proprietary proprietary 95.37
rule-based (nom.text.fix_diacritics) ours 0 Apache 2.0 41.06

Training

  • Base: vinai/bartpho-syllable-base (MIT license)
  • Corpus: 500,000 (input, target) pairs from a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized). Eval-leak guarded against the held-out nrl-ai/vn-diacritic-eval slices.
  • Validation: 5,000 held-out pairs from the same training mix.
  • Epochs: 5
  • Effective batch size: 32 (32 per device, grad-accum 1)
  • Learning rate: 0.0005 with cosine schedule, 500 warmup steps
  • Precision: bf16
  • Sequence length: input 256, target 256
  • Early stopping: patience=0 on eval_loss
  • Training time: 88.0 min on a single NVIDIA RTX 3090 24 GB
  • Seed: 42

Intended use

  • Recommended: Restoring diacritics in modern Vietnamese text where the input is predominantly diacritic-stripped (e.g. user typed without IME, OCR with no diacritic-aware font, or imported foreign-keyboard data).
  • Not recommended: Spelling correction (the model does not change letters, only adds tone/vowel marks), text generation, classification, or any task the input distribution doesn't match (heavy emoji / mixed-script text).

Limitations

  • Register skew. Encyclopedic Wikipedia training tilts the model toward formal/literary register. Conversational / dialect text may be slightly worse.
  • Proper noun ambiguity. Multiple plausible diacritisations exist for the same ASCII form (HungHùng / Hưng / Hứng). The model picks the most-likely-in-training, which may not match a specific person's name.
  • Long sentences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.

License & attribution

This fine-tune is released under Apache 2.0.

Base model: vinai/bartpho-syllable-base by VietAI, MIT license. Cite both this fine-tune and the base model:

@misc{vit5_2022,
  title={ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation},
  author={Long Phan and Hieu Tran and Hieu Nguyen and Trieu H. Trinh},
  year={2022},
  note={NAACL-SRW}
}

@misc{nom_vn_diacritic_2026,
  title={Vietnamese Diacritic Restoration — register-balanced ViT5 fine-tune},
  author={Nguyen, Viet-Anh and {Neural Research Lab}},
  year={2026},
  howpublished={\url{https://huggingface.co/nrl-ai/vn-diacritic-small}}
}

Training corpus license: CC-BY-SA-4.0 (Wikipedia). Output text from this model is therefore best treated as CC-BY-SA-4.0 if you want to be safe about derivative-rights propagation, even though the model weights themselves are Apache 2.0.

Downloads last month
148
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nrl-ai/vn-diacritic-small

Finetuned
(22)
this model

Datasets used to train nrl-ai/vn-diacritic-small