SindhiLM Tokenizer v3 (Unigram)

This tokenizer was specifically engineered and empirically tested for the Sindhi language, transitioning from a Byte-Level BPE model to a Unigram algorithm with Whitespace pre-tokenization to respect Sindhi morphological boundaries and prevent the shattering of Perso-Arabic text.

Corpus Cleaning

Trained on 200000 documents from a newly filtered version of the Sindhi corpus. Filtering was upgraded from a heuristic script-ratio check to a FastText LangID (lid.176) pass to aggressively remove Farsi, Urdu, and Pashto contamination.

Empirical Verification Results

  • Vocabulary Size: 32000
  • Algorithm: Unigram
  • Average Fertility Rate: 1.619 tokens per word (Tested on fresh, held-out documents).
  • Ground-Truth Morpheme Split Test: 1/9 exact matches against native-speaker evaluations.
  • Digraph Integrity: The ฺฏฺพ (gh) aspirated digraph is SOMETIMES split by this tokenizer.
  • Izafat Case (ุกูŽ / ุก): Words containing the Sindhi genitive construction previously shattered into 5-6 fragments in BPE. Under this Unigram model, they tokenize into 3 or fewer logical pieces (Passed verification).

Known Limitations

The following hand-corrected morpheme boundary splits failed during testing. This implies ambiguity requiring POS context unavailable to a pure subword tokenizer:

  • ฺฏฺพุฑู†: Expected ['ฺฏฺพุฑ', 'ู†'], but the tokenizer split into ['ฺฏ', 'ฺพ', 'ุฑ', 'ู†'].
  • ฺ‡ูˆฺชุฑูŠ: Expected ['ฺ‡ูˆฺชุฑ', 'ูŠ'], but the tokenizer split into ['ฺ‡ูˆฺชุฑูŠ'].
  • ุงุณุชุงุฏู†: Expected ['ุงุณุชุงุฏ', 'ุงู†'], but the tokenizer split into ['ุงุณุชุงุฏู†'].
  • ู„ฺฉฺพู†ุฏูˆ: Expected ['ู„ฺฉฺพ', 'ู†ุฏูˆ'], but the tokenizer split into ['ู„', 'ฺฉ', 'ฺพู†ุฏูˆ'].
  • ูพฺ™ู‡ูŠู„: Expected ['ูพฺ™ฺพ', 'ูŠู„'], but the tokenizer split into ['ูพฺ™ู‡ูŠู„'].
  • ูˆฺƒูŠ: Expected ['ูˆฺƒ', 'ูŠ'], but the tokenizer split into ['ูˆฺƒูŠ'].
  • ู†ู†ฺฺ™ูˆ: Expected ['ู†ู†ฺ', 'ฺ™ูˆ'], but the tokenizer split into ['ู†ู†ฺฺ™ูˆ'].
  • ู„ฺฉู†ุฏฺ™: Expected ['ู„ฺฉฺพ', 'ู†ุฏ', 'ฺ™'], but the tokenizer split into ['ู„ฺฉู†ุฏฺ™'].
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support