ByT5-small — Sindhi Spelling Correction
A finetuned version of google/byt5-small for automatic spelling correction in Sindhi (Arabic script). The model takes a misspelled Sindhi sentence as input and outputs the corrected version.
Model Details
| Field | Details |
|---|---|
| Base model | google/byt5-small |
| Model type | Encoder-Decoder (Seq2Seq) |
| Language | Sindhi (sd) — Arabic script |
| Task | Spelling correction (text2text-generation) |
| Parameters | 299,637,760 |
| License | Apache 2.0 |
Why ByT5?
ByT5 operates directly on raw UTF-8 bytes with no tokenizer vocabulary. This makes it ideal for Sindhi because:
- No out-of-vocabulary issues with rare characters or diacritics
- Naturally handles Arabic script variants (zabar, zer, pesh, shadda)
- Character-level corrections are learned directly from bytes
Training Details
Dataset
- Source: fahadqazi/Sindhi-Misspelled-Sentences
- Training rows: 60,000 pairs
- Validation rows: 3,000 pairs
- Test rows: 3,000 pairs
- Format:
incorrectsentence →correctsentence pairs
Hyperparameters
| Parameter | Value |
|---|---|
| Max input length | 128 bytes |
| Max target length | 128 bytes |
| Batch size (effective) | 32 (2 × 16 grad accum) |
| Epochs | 5 |
| Optimizer | Adafactor |
| Warmup steps | 500 |
| Gradient clipping | 1.0 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
Training Environment
- GPU: NVIDIA RTX (8GB VRAM)
- Framework: PyTorch + HuggingFace Transformers
- Training time: ~3 hours
Evaluation Results
Evaluated on a held-out test set of 3,000 Sindhi sentence pairs with Unicode NFC normalization applied before scoring.
| Metric | Score |
|---|---|
| Character Error Rate (CER) | 0.0447 |
| Exact Match | 0.2897 |
What these numbers mean
- CER 0.0447 — the model makes errors on only ~4.5% of characters. A 10-word Sindhi sentence is corrected with ~95.5% character accuracy.
- Exact Match 0.29 — 29% of sentences are corrected perfectly (every character matches). This metric is strict — even one wrong diacritic in a long sentence counts as failure.
Per-epoch training curve
| Epoch | CER | Exact Match |
|---|---|---|
| 1 | 0.0601 | 0.1797 |
| 2 | 0.0518 | 0.2383 |
| 3 | 0.0479 | 0.2600 |
| 4 | 0.0463 | 0.2747 |
| 5 | 0.0456 | 0.2807 |
| Test | 0.0447 | 0.2897 |
How to Use
Basic inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/byt5-sindhi-spell-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def correct_sindhi(text: str) -> str:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
)
outputs = model.generate(
**inputs,
max_length=128,
num_beams=4, # beam search for better quality
early_stopping=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# example
misspelled = "اسن يور ۽ اتر امريع ۾ بعد ۾ سروص سروس سيںٽر قائم ڪا آحں"
corrected = correct_sindhi(misspelled)
print("Input :", misspelled)
print("Corrected:", corrected)
Batch inference
import torch
def correct_batch(texts: list[str], batch_size: int = 8) -> list[str]:
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
inputs = tokenizer(
batch,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=128,
num_beams=4,
)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
results.extend(decoded)
return results
Limitations
- Max input length is 128 bytes (~40–50 Sindhi characters). Very long sentences will be truncated.
- Exact match is 29% — the model corrects most characters correctly but may not produce a perfectly identical sentence every time.
- Trained on synthetic errors — real-world human typos may differ from the error patterns in the training data.
- Diacritics are hard — Sindhi has many optional diacritical marks; the model may inconsistently add or remove them.
- Not a language model — the model corrects spelling at the character/byte level and does not understand sentence meaning.
Intended Use
- Sindhi text preprocessing pipelines
- OCR post-correction for Sindhi documents
- Input correction for Sindhi keyboard/typing tools
- Data cleaning for Sindhi NLP datasets
Out-of-Scope Use
- Other languages (model is Sindhi-specific)
- Grammatical error correction (this is spelling only)
- Sentences longer than ~50 characters (truncation applies)
Citation
If you use this model in your research, please cite:
@misc{byt5-sindhi-spell-correction,
title = {ByT5-small Fine-tuned for Sindhi Spelling Correction},
author = {Danish},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/your-username/byt5-sindhi-spell-correction}
}
Acknowledgements
- Base model: google/byt5-small
- Dataset: fahadqazi/Sindhi-Misspelled-Sentences
- Training framework: HuggingFace Transformers
- Downloads last month
- 50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for DanishMahdi/snd_spell_corrector
Base model
google/byt5-small