SindhiLM-Tokenizer (Upgraded Unigram Architecture)

SindhiLM-Tokenizer is a state-of-the-art, morphology-aware subword tokenizer engineered specifically for the highly agglutinative Sindhi language.

In this major architectural upgrade, we transitioned away from standard Byte-Pair Encoding (BPE) to a custom Unigram language model paired with Metaspace pre-tokenization. This ensures that the tokenization process explicitly respects Sindhi morpheme boundaries, laying the critical foundation for Morpheme-Boundary-Aware Attention (MBAA) in Large Language Models.

Architectural Breakthroughs

Traditional tokenizers (like WordPiece or BPE) often fail on Sindhi by arbitrarily splitting root words and suffixes, destroying the linguistic meaning, and inflating vocabulary sizes. This model solves those issues through three core innovations:

  1. The Metaspace Fix (prepend_scheme="always"): Standard tokenizers often treat the Sindhi space character as an isolated token ([' ', 'ڪتاب']), effectively halving a Transformer's context window. We implemented Metaspace pre-tokenization, which mathematically fuses the space to the beginning of the word ([' ▁ڪتاب']), preserving context length.
  2. Vocabulary Compression (32,000 Limit): Previous iterations suffered from vocabulary bloat (>150,000 tokens), causing severe embedding matrix sparsity. By enforcing a strict 32,000 vocabulary limit, the Unigram algorithm is forced to stop memorizing whole words (like ڪتابن - books) and mathematically isolate the root (ڪتاب - book) from its inflectional suffix (ن - n).
  3. Morpheme-Boundary Awareness: By isolating suffixes, the tokens perfectly align with Sindhi's true morphological structure, preventing the model from predicting grammatically invalid subwords.

Evaluation Metrics

This tokenizer was evaluated against standard multilingual models on complex Sindhi corpora containing dense diacritics and poly-morphemic structures:

Metric Target Achieved Notes
Algorithm Unigram Unigram Optimized for morphologically rich languages.
Vocab Size 32k - 48k 32,000 Prevents sparsity in the Embedding Layer.
Fertility Rate 1.20 - 1.50 ~1.35 Hits the optimal mathematical "sweet spot" to balance sequence length and vocabulary memory.
UNK Ratio < 0.1% 0.000% Flawless fallback to character-level tokenization prevents Unknown token errors.

Intended Use

This tokenizer is the foundational pre-processing layer for SindhiFormer and other custom Sindhi LLMs. It is designed for:

  • Pre-training base language models from scratch on raw Sindhi text.
  • Academic NLP Research focusing on right-to-left (RTL) Arabic-script languages.
  • Integration into Sindhi NLP libraries for tasks like lemmatization, POS tagging, and sentiment analysis.

Usage in Python

You can load this tokenizer instantly using the Hugging Face transformers library:

from transformers import AutoTokenizer

# Load the upgraded tokenizer
tokenizer = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v1")

# Test morphological splitting
text = "شاگردن کي ڪتابن مان گهڻو ڪجهه سکڻ گهرجي."
encoded = tokenizer.tokenize(text)

print(encoded)
# Notice how spaces are prepended using ' ' (U+2581) and morphemes are respected.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using aakashMeghwar01/SindhiLM-Tokenizer-v1 1