ByT5-small — Sindhi Spelling Correction

A finetuned version of google/byt5-small for automatic spelling correction in Sindhi (Arabic script). The model takes a misspelled Sindhi sentence as input and outputs the corrected version.


Model Details

Field Details
Base model google/byt5-small
Model type Encoder-Decoder (Seq2Seq)
Language Sindhi (sd) — Arabic script
Task Spelling correction (text2text-generation)
Parameters 299,637,760
License Apache 2.0

Why ByT5?

ByT5 operates directly on raw UTF-8 bytes with no tokenizer vocabulary. This makes it ideal for Sindhi because:

  • No out-of-vocabulary issues with rare characters or diacritics
  • Naturally handles Arabic script variants (zabar, zer, pesh, shadda)
  • Character-level corrections are learned directly from bytes

Training Details

Dataset

Hyperparameters

Parameter Value
Max input length 128 bytes
Max target length 128 bytes
Batch size (effective) 32 (2 × 16 grad accum)
Epochs 5
Optimizer Adafactor
Warmup steps 500
Gradient clipping 1.0
Precision bf16
Gradient checkpointing Yes

Training Environment

  • GPU: NVIDIA RTX (8GB VRAM)
  • Framework: PyTorch + HuggingFace Transformers
  • Training time: ~3 hours

Evaluation Results

Evaluated on a held-out test set of 3,000 Sindhi sentence pairs with Unicode NFC normalization applied before scoring.

Metric Score
Character Error Rate (CER) 0.0447
Exact Match 0.2897

What these numbers mean

  • CER 0.0447 — the model makes errors on only ~4.5% of characters. A 10-word Sindhi sentence is corrected with ~95.5% character accuracy.
  • Exact Match 0.29 — 29% of sentences are corrected perfectly (every character matches). This metric is strict — even one wrong diacritic in a long sentence counts as failure.

Per-epoch training curve

Epoch CER Exact Match
1 0.0601 0.1797
2 0.0518 0.2383
3 0.0479 0.2600
4 0.0463 0.2747
5 0.0456 0.2807
Test 0.0447 0.2897

How to Use

Basic inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/byt5-sindhi-spell-correction"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def correct_sindhi(text: str) -> str:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
    )
    outputs = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,          # beam search for better quality
        early_stopping=True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# example
misspelled = "اسن يور ۽ اتر امريع ۾ بعد ۾ سروص سروس سيںٽر قائم ڪا آحں"
corrected  = correct_sindhi(misspelled)
print("Input    :", misspelled)
print("Corrected:", corrected)

Batch inference

import torch

def correct_batch(texts: list[str], batch_size: int = 8) -> list[str]:
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        inputs = tokenizer(
            batch,
            return_tensors="pt",
            max_length=128,
            truncation=True,
            padding=True,
        )
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=128,
                num_beams=4,
            )
        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        results.extend(decoded)
    return results

Limitations

  • Max input length is 128 bytes (~40–50 Sindhi characters). Very long sentences will be truncated.
  • Exact match is 29% — the model corrects most characters correctly but may not produce a perfectly identical sentence every time.
  • Trained on synthetic errors — real-world human typos may differ from the error patterns in the training data.
  • Diacritics are hard — Sindhi has many optional diacritical marks; the model may inconsistently add or remove them.
  • Not a language model — the model corrects spelling at the character/byte level and does not understand sentence meaning.

Intended Use

  • Sindhi text preprocessing pipelines
  • OCR post-correction for Sindhi documents
  • Input correction for Sindhi keyboard/typing tools
  • Data cleaning for Sindhi NLP datasets

Out-of-Scope Use

  • Other languages (model is Sindhi-specific)
  • Grammatical error correction (this is spelling only)
  • Sentences longer than ~50 characters (truncation applies)

Citation

If you use this model in your research, please cite:

@misc{byt5-sindhi-spell-correction,
  title     = {ByT5-small Fine-tuned for Sindhi Spelling Correction},
  author    = {Danish},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/your-username/byt5-sindhi-spell-correction}
}

Acknowledgements

Downloads last month
50
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanishMahdi/snd_spell_corrector

Finetuned
(240)
this model

Dataset used to train DanishMahdi/snd_spell_corrector