SindhiLM-Tokenizer-v2
Morpheme-boundary-aware BPE tokenizer for Sindhi, merged into Qwen2.5-0.5B-Instruct.
Key Improvements over v1
| Feature | v1 | v2 |
|---|---|---|
Root integrity (ڪاوڙ) |
Broken (ڪاو|ڙ) |
Intact (ڪاوڙ|يندڙ) |
| Byte ghosts (avg) | 21-27 per sentence | 2-8 per sentence |
Arabic comma ، |
Rejected as noise | Preserved |
| Context efficiency vs Qwen | 1.52x | 1.47x |
| Sindhi tokens added | 7,978 | 4,571 (cleaner) |
Innovations
- V3 Sindhi Pre-Tokenizer — Regex pattern that keeps aspirated digraphs (گھر, جھيل) intact
- SindhiNLTK Morpheme Pre-Segmentation — Corpus pre-segmented at morpheme boundaries before BPE training
- Fixed Noise Filter — Arabic comma (U+060C) no longer rejected; single-char tokens excluded
- 32K Vocab — Tighter budget forces smarter merges vs wasteful 40K
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v2")
tokens = tok.tokenize("ڪاوڙيندڙ ماڻهو گھر ۾ مسئلا پيدا ڪندو آهي")
print(tokens)
Training Data
Trained on sindhi-corpus-505m (742K docs, ~505M tokens).
Author
Aakash Meghwar — HuggingFace · GitHub
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support