Instructions to use nrl-ai/vn-spell-correction-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nrl-ai/vn-spell-correction-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nrl-ai/vn-spell-correction-base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("nrl-ai/vn-spell-correction-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-spell-correction-base")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nrl-ai/vn-spell-correction-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nrl-ai/vn-spell-correction-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-spell-correction-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nrl-ai/vn-spell-correction-base

SGLang

How to use nrl-ai/vn-spell-correction-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nrl-ai/vn-spell-correction-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-spell-correction-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nrl-ai/vn-spell-correction-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nrl-ai/vn-spell-correction-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nrl-ai/vn-spell-correction-base with Docker Model Runner:
```
docker model run hf.co/nrl-ai/vn-spell-correction-base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

nrl-ai/vn-spell-correction-base — Vietnamese spell correction (ViT5 fine-tune)

Fixes typos, missed accents, and OCR-style char errors in Vietnamese text in one pass: Toi yu Vit Nam → Tôi yêu Việt Nam. Strictly more than diacritic restoration — handles letter-level mistakes, missing / extra characters, and OCR substitutions like o↔0, l↔1, m↔rn.

Fine-tuned from VietAI/vit5-base on the nrl-ai/vn-spell-correction-train corpus (459K (noisy, clean) Vietnamese pairs synthesized from a register-balanced Wiki+news mix via nom.text.noise).

Adoption gate: ✅ passed — passed (light avg 0.9832, heavy avg 0.9703).

Quick start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("nrl-ai/vn-spell-correction-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nrl-ai/vn-spell-correction-base").eval()

text = "Toi yu Vit Nam"
out = model.generate(**tok(text, return_tensors="pt"), max_length=256)
print(tok.decode(out[0], skip_special_tokens=True))
# Tôi yêu Việt Nam

For batched inference (recommended for high-throughput pipelines):

from nom.text.diacritic_models import HFDiacriticModel
restorer = HFDiacriticModel(model_id="nrl-ai/vn-spell-correction-base")
fixed = restorer.predict_batch(noisy_sentences, batch_size=16)

Evaluation — 8-split spell-correction grid

Evaluation uses nrl-ai/vn-spell-correction-eval (2,098 pairs across 4 registers x 2 noise levels). Word accuracy after NFC + punctuation normalization on both sides.

Register	Sents	Word acc	Sent exact	Mean ms/sent
Modern business / news (light)	44	98.74 %	84.09 %	152
Formal / legal-prose (light)	65	99.75 %	93.85 %	290
Conversational (light)	179	97.68 %	81.56 %	106
Classical literary (light)	608	97.11 %	73.03 %	171
Modern business / news (heavy / OCR)	55	98.97 %	85.45 %	146
Formal / legal-prose (heavy / OCR)	72	99.05 %	83.33 %	273
Conversational (heavy / OCR)	287	95.54 %	73.52 %	103
Classical literary (heavy / OCR)	788	94.56 %	56.85 %	160

Each split corresponds to a (register, noise level) combination:

light noise — ~5 % char-level edit distance, models a person typing Vietnamese on a keyboard with a few accent slips and the occasional fat-finger.
heavy noise — ~15-20 % edit distance, models OCR output of a mid-quality scan with diacritic drops + char confusions.

How we compare

Where this model sits in the public Vietnamese spell-correction landscape — same 8-split grid for every measured row.

Model	Family	Params	License	light avg	heavy avg
this → `nrl-ai/vn-spell-correction-base`	ours	?	Apache 2.0	98.32	97.03
`bmd1905/vietnamese-correction-v2`	public	400 M	Apache 2.0	86.69	72.62
`iAmHieu2012/vit5-vietnamese-spelling-correction`	public	220 M	MIT	80.72	56.55

The two averaged columns:

light avg = mean word accuracy across the 4 light-noise splits.
heavy avg = mean word accuracy across the 4 heavy-noise splits.

Training

Base: VietAI/vit5-base (MIT license)
Corpus: 545,000 (noisy, clean) pairs from nrl-ai/vn-spell-correction-train. Eval-leak guarded against nrl-ai/vn-spell-correction-eval and nrl-ai/vn-diacritic-eval.
Validation: 5,000 held-out pairs.
Epochs: 5
Effective batch size: 32 (32 per device, grad-accum 1)
Learning rate: 0.0005 with cosine schedule, 500 warmup steps
Precision: bf16
Sequence length: input 256, target 256
Early stopping: patience=0 on eval_loss
Training time: 215.0 min on a single NVIDIA RTX 3090 24 GB
Seed: 42

Intended use

Recommended: cleaning up noisy Vietnamese text — OCR output, user-generated text from non-VN-IME keyboards, form data with typos, social-media short-form. Strictly harder than diacritic restoration but covers it as a subset.
Not recommended: text generation, classification, sentiment, NER, or any task the input distribution doesn't match.

Limitations

In-distribution metric, real-world is harder — measured. Training and eval both use nom.text.noise. The synthetic 8-split numbers above measure how well we invert our noise generator. We also benchmark on a 150-sentence hand-curated OOD eval (6 registers, bootstrap 95 % CI):

Slice	this model	Toshiiiii1 (public)	bmd1905 (public)
forum_25	59.45 %	60.11 %	59.02 %
mobile_25	95.01 %	96.95 %	88.09 %
telex_real_25	17.38 %	18.54 %	11.58 %
ocr_25	93.62 %	94.22 %	47.42 %
legal_real_25	95.09 %	93.80 %	54.90 %
news_real_25	96.54 %	94.07 %	30.62 %
Aggregate (n=150)	77.43 %	77.40 %	49.21 %

Synthetic light_avg is 98.58 %, real-world aggregate is 77.43 %. The 21 pp gap is the cost of training only on light/telex_typo/heavy noise — those capture the surface of typos but not real Telex keystroke artefacts (dduwojc for được) or forum-style abbreviations (ko bt for không biết). On OOD this model ties with Toshiiiii1 (77.43 vs 77.40, within bootstrap CI). v0.2.29 retraining on the v2 multi-source corpus + comprehensive_noise() (which adds telex_grammar_noise() + mobile_noise()) is queued and targets a clear OOD lead, not just a synthetic one.

Heavy-noise corner cases. OCR outputs that drop entire words or add hallucinated text are out-of-scope; the noise generator we trained on caps edits per sentence (max 25 % edit ratio).
Long sequences truncate at 256 sub-word tokens. Split paragraphs at sentence boundaries before calling.
No grammar or stylistic correction. This model fixes character / syllable / diacritic errors but doesn't rewrite phrasing.
Confidence intervals on small splits. business_55 (44/55 sents) and formal_72 (65/72 sents) have ±3-4 pp 95 % CI; the larger literary_800 split has ±1 pp. Treat single-pp differences with care.

License & attribution

Released under Apache 2.0. Cite both this model and the base:

@misc{nom_vn_spell_correction_2026,
  title={Vietnamese Spell Correction — register-balanced fine-tune},
  author={Nguyen, Viet-Anh and {Neural Research Lab}},
  year={2026},
  howpublished={\url{https://huggingface.co/nrl-ai/vn-spell-correction-base}}
}

Training data inherits CC-BY-SA-4.0 (Wikipedia portion) + CC-BY-4.0 (news portion). Output text is best treated as CC-BY-SA-4.0 if you want to be safe.

Downloads last month: 505

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for nrl-ai/vn-spell-correction-base

Base model

VietAI/vit5-base

Finetuned

(90)

this model

Quantizations

1 model

nrl-ai
/

vn-spell-correction-base