Instructions to use nrl-ai/vn-diacritic-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nrl-ai/vn-diacritic-small with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nrl-ai/vn-diacritic-small")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("nrl-ai/vn-diacritic-small")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-diacritic-small")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nrl-ai/vn-diacritic-small with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nrl-ai/vn-diacritic-small"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-diacritic-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nrl-ai/vn-diacritic-small

SGLang

How to use nrl-ai/vn-diacritic-small with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nrl-ai/vn-diacritic-small" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-diacritic-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nrl-ai/vn-diacritic-small" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-diacritic-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nrl-ai/vn-diacritic-small with Docker Model Runner:
```
docker model run hf.co/nrl-ai/vn-diacritic-small
```

nrl-ai/vn-diacritic-small — Vietnamese diacritic restoration (BARTpho-syllable fine-tune)

Restores diacritics on Vietnamese text written without them (Toi yeu Viet Nam → Tôi yêu Việt Nam). Fine-tuned from vinai/bartpho-syllable-base on a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized).

Adoption gate: ⚠️ did not pass (business_55 word_accuracy 0.9444 < gate 0.9600).

Quick start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("nrl-ai/vn-diacritic-small")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-diacritic-small").eval()

text = "Toi yeu Viet Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam

For the full pipeline (with rule-based + LLM fallbacks), use the nom-vn Python package:

from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-diacritic-small")
restorer("Toi yeu Viet Nam")  # 'Tôi yêu Việt Nam'

Evaluation — 4-register matrix

Measured against Toshiiiii1/Vietnamese_diacritics_restoration_5th (public SOTA at the time of training). Word accuracy after Unicode NFC

punctuation normalization on both sides.

Register	Sents	Word acc	Δ vs Toshiiiii1	Mean ms/sent
Formal / legal-prose (UDHR, public domain)	72	91.51 %	-6.63 pp	117
Modern business / contracts / news (CC0)	55	94.44 %	-3.37 pp	66
Conversational (Tatoeba, CC-BY 2.0 FR)	300	90.68 %	-3.26 pp	48
Classical literary (UD-VTB, CC-BY-SA-4.0)	800	86.33 %	-3.07 pp	78

Each eval corpus is open-license and reproducible from the nom-vn repo:

business_55 — benchmarks/data/diacritic_eval_v0.txt (CC0)
literary_udvtb — benchmarks/data/ud_vi_vtb/test.conllu (CC-BY-SA-4.0)
conversational_300 — benchmarks/data/tatoeba_vi/diacritic_eval_300.txt (CC-BY 2.0 FR)
formal_udhr — benchmarks/data/udhr_vi/diacritic_eval_udhr.txt (public domain)

How we compare

Where this model sits in the public Vietnamese diacritic-restoration landscape — same 4-register grid for every measured row. Bold = best in column. Cells marked "—" weren't run on that register.

Model	Family	Params	License	formal_72	business_55	conv_300	literary_800
`Toshiiiii1/...5th`	public	200 M	Apache 2.0	98.14	97.81	93.94	89.40
`nrl-ai/vn-diacritic-vit5-base`	ours	220 M	Apache 2.0	99.43	94.98	94.12	90.24
this → `nrl-ai/vn-diacritic-small`	ours	115 M	Apache 2.0	91.51	94.44	90.68	86.33
`qthuan2604/ViT5_Restore_Diacritics_Vietnamese`	public	220 M	MIT	—	90.59	—	—
`qthuan2604/BARTPho_Syllable_Restore_Diacritics_Vietnamese`	public	115 M	MIT	—	83.92	—	—
OpenAI gpt-4o-mini (cloud)	external	proprietary	proprietary	—	95.37	—	—
rule-based (`nom.text.fix_diacritics`)	ours	0	Apache 2.0	—	41.06	—	—

Training

Base: vinai/bartpho-syllable-base (MIT license)
Corpus: 500,000 (input, target) pairs from a register-balanced mix of Vietnamese Wikipedia (CC-BY-SA-4.0) and Vietnamese news (tmnam20/Vietnamese-News-dedup, CC-BY-4.0, NFC-normalized). Eval-leak guarded against the held-out nrl-ai/vn-diacritic-eval slices.
Validation: 5,000 held-out pairs from the same training mix.
Epochs: 5
Effective batch size: 32 (32 per device, grad-accum 1)
Learning rate: 0.0005 with cosine schedule, 500 warmup steps
Precision: bf16
Sequence length: input 256, target 256
Early stopping: patience=0 on eval_loss
Training time: 88.0 min on a single NVIDIA RTX 3090 24 GB
Seed: 42

Intended use

Recommended: Restoring diacritics in modern Vietnamese text where the input is predominantly diacritic-stripped (e.g. user typed without IME, OCR with no diacritic-aware font, or imported foreign-keyboard data).
Not recommended: Spelling correction (the model does not change letters, only adds tone/vowel marks), text generation, classification, or any task the input distribution doesn't match (heavy emoji / mixed-script text).

Limitations

Register skew. Encyclopedic Wikipedia training tilts the model toward formal/literary register. Conversational / dialect text may be slightly worse.
Proper noun ambiguity. Multiple plausible diacritisations exist for the same ASCII form (Hung → Hùng / Hưng / Hứng). The model picks the most-likely-in-training, which may not match a specific person's name.
Long sentences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.

License & attribution

This fine-tune is released under Apache 2.0.

Base model: vinai/bartpho-syllable-base by VietAI, MIT license. Cite both this fine-tune and the base model:

@misc{vit5_2022,
  title={ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation},
  author={Long Phan and Hieu Tran and Hieu Nguyen and Trieu H. Trinh},
  year={2022},
  note={NAACL-SRW}
}

@misc{nom_vn_diacritic_2026,
  title={Vietnamese Diacritic Restoration — register-balanced ViT5 fine-tune},
  author={Nguyen, Viet-Anh and {Neural Research Lab}},
  year={2026},
  howpublished={\url{https://huggingface.co/nrl-ai/vn-diacritic-small}}
}

Training corpus license: CC-BY-SA-4.0 (Wikipedia). Output text from this model is therefore best treated as CC-BY-SA-4.0 if you want to be safe about derivative-rights propagation, even though the model weights themselves are Apache 2.0.

Downloads last month: 148

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nrl-ai/vn-diacritic-small

Base model

vinai/bartpho-syllable-base

Finetuned

(22)

this model

nrl-ai
/

vn-diacritic-small