Instructions to use nrl-ai/vn-diacritic-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nrl-ai/vn-diacritic-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nrl-ai/vn-diacritic-small")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("nrl-ai/vn-diacritic-small") model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-diacritic-small") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nrl-ai/vn-diacritic-small with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nrl-ai/vn-diacritic-small" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nrl-ai/vn-diacritic-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nrl-ai/vn-diacritic-small
- SGLang
How to use nrl-ai/vn-diacritic-small with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nrl-ai/vn-diacritic-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nrl-ai/vn-diacritic-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nrl-ai/vn-diacritic-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nrl-ai/vn-diacritic-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nrl-ai/vn-diacritic-small with Docker Model Runner:
docker model run hf.co/nrl-ai/vn-diacritic-small
nrl-ai/vn-diacritic-small — Vietnamese diacritic restoration (BARTpho-syllable fine-tune)
Restores diacritics on Vietnamese text written without them
(Toi yeu Viet Nam → Tôi yêu Việt Nam). Fine-tuned from
vinai/bartpho-syllable-base on a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized).
Adoption gate: ⚠️ did not pass (business_55 word_accuracy 0.9444 < gate 0.9600).
Quick start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tok = AutoTokenizer.from_pretrained("nrl-ai/vn-diacritic-small")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-diacritic-small").eval()
text = "Toi yeu Viet Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam
For the full pipeline (with rule-based + LLM fallbacks), use the
nom-vn Python package:
from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-diacritic-small")
restorer("Toi yeu Viet Nam") # 'Tôi yêu Việt Nam'
Evaluation — 4-register matrix
Measured against Toshiiiii1/Vietnamese_diacritics_restoration_5th
(public SOTA at the time of training). Word accuracy after Unicode NFC
- punctuation normalization on both sides.
| Register | Sents | Word acc | Δ vs Toshiiiii1 | Mean ms/sent |
|---|---|---|---|---|
| Formal / legal-prose (UDHR, public domain) | 72 | 91.51 % | -6.63 pp | 117 |
| Modern business / contracts / news (CC0) | 55 | 94.44 % | -3.37 pp | 66 |
| Conversational (Tatoeba, CC-BY 2.0 FR) | 300 | 90.68 % | -3.26 pp | 48 |
| Classical literary (UD-VTB, CC-BY-SA-4.0) | 800 | 86.33 % | -3.07 pp | 78 |
Each eval corpus is open-license and reproducible from the
nom-vn repo:
- business_55 —
benchmarks/data/diacritic_eval_v0.txt(CC0) - literary_udvtb —
benchmarks/data/ud_vi_vtb/test.conllu(CC-BY-SA-4.0) - conversational_300 —
benchmarks/data/tatoeba_vi/diacritic_eval_300.txt(CC-BY 2.0 FR) - formal_udhr —
benchmarks/data/udhr_vi/diacritic_eval_udhr.txt(public domain)
How we compare
Where this model sits in the public Vietnamese diacritic-restoration landscape — same 4-register grid for every measured row. Bold = best in column. Cells marked "—" weren't run on that register.
| Model | Family | Params | License | formal_72 | business_55 | conv_300 | literary_800 |
|---|---|---|---|---|---|---|---|
Toshiiiii1/...5th |
public | 200 M | Apache 2.0 | 98.14 | 97.81 | 93.94 | 89.40 |
nrl-ai/vn-diacritic-vit5-base |
ours | 220 M | Apache 2.0 | 99.43 | 94.98 | 94.12 | 90.24 |
this → nrl-ai/vn-diacritic-small |
ours | 115 M | Apache 2.0 | 91.51 | 94.44 | 90.68 | 86.33 |
qthuan2604/ViT5_Restore_Diacritics_Vietnamese |
public | 220 M | MIT | — | 90.59 | — | — |
qthuan2604/BARTPho_Syllable_Restore_Diacritics_Vietnamese |
public | 115 M | MIT | — | 83.92 | — | — |
| OpenAI gpt-4o-mini (cloud) | external | proprietary | proprietary | — | 95.37 | — | — |
rule-based (nom.text.fix_diacritics) |
ours | 0 | Apache 2.0 | — | 41.06 | — | — |
Training
- Base:
vinai/bartpho-syllable-base(MIT license) - Corpus: 500,000 (input, target) pairs from a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (
tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized). Eval-leak guarded against the held-outnrl-ai/vn-diacritic-evalslices. - Validation: 5,000 held-out pairs from the same training mix.
- Epochs: 5
- Effective batch size: 32 (32 per device, grad-accum 1)
- Learning rate: 0.0005 with
cosineschedule, 500 warmup steps - Precision: bf16
- Sequence length: input 256, target 256
- Early stopping: patience=0 on
eval_loss - Training time: 88.0 min on a single NVIDIA RTX 3090 24 GB
- Seed: 42
Intended use
- Recommended: Restoring diacritics in modern Vietnamese text where the input is predominantly diacritic-stripped (e.g. user typed without IME, OCR with no diacritic-aware font, or imported foreign-keyboard data).
- Not recommended: Spelling correction (the model does not change letters, only adds tone/vowel marks), text generation, classification, or any task the input distribution doesn't match (heavy emoji / mixed-script text).
Limitations
- Register skew. Encyclopedic Wikipedia training tilts the model toward formal/literary register. Conversational / dialect text may be slightly worse.
- Proper noun ambiguity. Multiple plausible diacritisations exist for the
same ASCII form (
Hung→Hùng/Hưng/Hứng). The model picks the most-likely-in-training, which may not match a specific person's name. - Long sentences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.
License & attribution
This fine-tune is released under Apache 2.0.
Base model: vinai/bartpho-syllable-base by VietAI, MIT
license. Cite both this fine-tune and the base model:
@misc{vit5_2022,
title={ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation},
author={Long Phan and Hieu Tran and Hieu Nguyen and Trieu H. Trinh},
year={2022},
note={NAACL-SRW}
}
@misc{nom_vn_diacritic_2026,
title={Vietnamese Diacritic Restoration — register-balanced ViT5 fine-tune},
author={Nguyen, Viet-Anh and {Neural Research Lab}},
year={2026},
howpublished={\url{https://huggingface.co/nrl-ai/vn-diacritic-small}}
}
Training corpus license: CC-BY-SA-4.0 (Wikipedia). Output text from this model is therefore best treated as CC-BY-SA-4.0 if you want to be safe about derivative-rights propagation, even though the model weights themselves are Apache 2.0.
- Downloads last month
- 148
Model tree for nrl-ai/vn-diacritic-small
Base model
vinai/bartpho-syllable-base