SindhiLM Tokenizer v3 (Unigram)

This tokenizer was specifically engineered and empirically tested for the Sindhi language, transitioning from a Byte-Level BPE model to a Unigram algorithm with Whitespace pre-tokenization to respect Sindhi morphological boundaries and prevent the shattering of Perso-Arabic text.

Corpus Cleaning

Trained on 200000 documents from a newly filtered version of the Sindhi corpus. Filtering was upgraded from a heuristic script-ratio check to a FastText LangID (lid.176) pass to aggressively remove Farsi, Urdu, and Pashto contamination.

Empirical Verification Results

Vocabulary Size: 32000
Algorithm: Unigram
Average Fertility Rate: 1.619 tokens per word (Tested on fresh, held-out documents).
Ground-Truth Morpheme Split Test: 1/9 exact matches against native-speaker evaluations.
Digraph Integrity: The گھ (gh) aspirated digraph is SOMETIMES split by this tokenizer.
Izafat Case (ءَ / ء): Words containing the Sindhi genitive construction previously shattered into 5-6 fragments in BPE. Under this Unigram model, they tokenize into 3 or fewer logical pieces (Passed verification).

Known Limitations

The following hand-corrected morpheme boundary splits failed during testing. This implies ambiguity requiring POS context unavailable to a pure subword tokenizer:

گھرن: Expected ['گھر', 'ن'], but the tokenizer split into ['گ', 'ھ', 'ر', 'ن'].
ڇوڪري: Expected ['ڇوڪر', 'ي'], but the tokenizer split into ['ڇوڪري'].
استادن: Expected ['استاد', 'ان'], but the tokenizer split into ['استادن'].
لکھندو: Expected ['لکھ', 'ندو'], but the tokenizer split into ['ل', 'ک', 'ھندو'].
پڙهيل: Expected ['پڙھ', 'يل'], but the tokenizer split into ['پڙهيل'].
وڃي: Expected ['وڃ', 'ي'], but the tokenizer split into ['وڃي'].
ننڍڙو: Expected ['ننڍ', 'ڙو'], but the tokenizer split into ['ننڍڙو'].
لکندڙ: Expected ['لکھ', 'ند', 'ڙ'], but the tokenizer split into ['لکندڙ'].

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support