SindhiLM Tokenizer v3 (Unigram)
This tokenizer was specifically engineered and empirically tested for the Sindhi language, transitioning from a Byte-Level BPE model to a Unigram algorithm with Whitespace pre-tokenization to respect Sindhi morphological boundaries and prevent the shattering of Perso-Arabic text.
Corpus Cleaning
Trained on 200000 documents from a newly filtered version of the Sindhi corpus. Filtering was upgraded from a heuristic script-ratio check to a FastText LangID (lid.176) pass to aggressively remove Farsi, Urdu, and Pashto contamination.
Empirical Verification Results
- Vocabulary Size: 32000
- Algorithm: Unigram
- Average Fertility Rate:
1.619tokens per word (Tested on fresh, held-out documents). - Ground-Truth Morpheme Split Test:
1/9exact matches against native-speaker evaluations. - Digraph Integrity: The
ฺฏฺพ(gh) aspirated digraph is SOMETIMES split by this tokenizer. - Izafat Case (ุกู / ุก): Words containing the Sindhi genitive construction previously shattered into 5-6 fragments in BPE. Under this Unigram model, they tokenize into 3 or fewer logical pieces (Passed verification).
Known Limitations
The following hand-corrected morpheme boundary splits failed during testing. This implies ambiguity requiring POS context unavailable to a pure subword tokenizer:
- ฺฏฺพุฑู: Expected
['ฺฏฺพุฑ', 'ู'], but the tokenizer split into['ฺฏ', 'ฺพ', 'ุฑ', 'ู']. - ฺฺูชุฑู: Expected
['ฺฺูชุฑ', 'ู'], but the tokenizer split into['ฺฺูชุฑู']. - ุงุณุชุงุฏู: Expected
['ุงุณุชุงุฏ', 'ุงู'], but the tokenizer split into['ุงุณุชุงุฏู']. - ฺูฉฺพูุฏู: Expected
['ฺูฉฺพ', 'ูุฏู'], but the tokenizer split into['ู', 'ฺฉ', 'ฺพูุฏู']. - ูพฺููู: Expected
['ูพฺฺพ', 'ูู'], but the tokenizer split into['ูพฺููู']. - ฺูู: Expected
['ฺู', 'ู'], but the tokenizer split into['ฺูู']. - ฺฺููู: Expected
['ฺูู', 'ฺู'], but the tokenizer split into['ฺฺููู']. - ฺูฉูุฏฺ: Expected
['ฺูฉฺพ', 'ูุฏ', 'ฺ'], but the tokenizer split into['ฺูฉูุฏฺ'].
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support