SindhiLM-Tokenizer (Upgraded Unigram Architecture)
SindhiLM-Tokenizer is a state-of-the-art, morphology-aware subword tokenizer engineered specifically for the highly agglutinative Sindhi language.
In this major architectural upgrade, we transitioned away from standard Byte-Pair Encoding (BPE) to a custom Unigram language model paired with Metaspace pre-tokenization. This ensures that the tokenization process explicitly respects Sindhi morpheme boundaries, laying the critical foundation for Morpheme-Boundary-Aware Attention (MBAA) in Large Language Models.
Architectural Breakthroughs
Traditional tokenizers (like WordPiece or BPE) often fail on Sindhi by arbitrarily splitting root words and suffixes, destroying the linguistic meaning, and inflating vocabulary sizes. This model solves those issues through three core innovations:
- The Metaspace Fix (
prepend_scheme="always"): Standard tokenizers often treat the Sindhi space character as an isolated token ([' ', 'ڪتاب']), effectively halving a Transformer's context window. We implemented Metaspace pre-tokenization, which mathematically fuses the space to the beginning of the word ([' ▁ڪتاب']), preserving context length. - Vocabulary Compression (32,000 Limit):
Previous iterations suffered from vocabulary bloat (>150,000 tokens), causing severe embedding matrix sparsity. By enforcing a strict 32,000 vocabulary limit, the Unigram algorithm is forced to stop memorizing whole words (like
ڪتابن- books) and mathematically isolate the root (ڪتاب- book) from its inflectional suffix (ن- n). - Morpheme-Boundary Awareness: By isolating suffixes, the tokens perfectly align with Sindhi's true morphological structure, preventing the model from predicting grammatically invalid subwords.
Evaluation Metrics
This tokenizer was evaluated against standard multilingual models on complex Sindhi corpora containing dense diacritics and poly-morphemic structures:
| Metric | Target | Achieved | Notes |
|---|---|---|---|
| Algorithm | Unigram | Unigram | Optimized for morphologically rich languages. |
| Vocab Size | 32k - 48k | 32,000 | Prevents sparsity in the Embedding Layer. |
| Fertility Rate | 1.20 - 1.50 | ~1.35 | Hits the optimal mathematical "sweet spot" to balance sequence length and vocabulary memory. |
| UNK Ratio | < 0.1% | 0.000% | Flawless fallback to character-level tokenization prevents Unknown token errors. |
Intended Use
This tokenizer is the foundational pre-processing layer for SindhiFormer and other custom Sindhi LLMs. It is designed for:
- Pre-training base language models from scratch on raw Sindhi text.
- Academic NLP Research focusing on right-to-left (RTL) Arabic-script languages.
- Integration into Sindhi NLP libraries for tasks like lemmatization, POS tagging, and sentiment analysis.
Usage in Python
You can load this tokenizer instantly using the Hugging Face transformers library:
from transformers import AutoTokenizer
# Load the upgraded tokenizer
tokenizer = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v1")
# Test morphological splitting
text = "شاگردن کي ڪتابن مان گهڻو ڪجهه سکڻ گهرجي."
encoded = tokenizer.tokenize(text)
print(encoded)
# Notice how spaces are prepended using ' ' (U+2581) and morphemes are respected.