SindhiLM-Tokenizer-v2

Morpheme-boundary-aware BPE tokenizer for Sindhi, merged into Qwen2.5-0.5B-Instruct.

Key Improvements over v1

Feature v1 v2
Root integrity (ڪاوڙ) Broken (ڪاو|ڙ) Intact (ڪاوڙ|يندڙ)
Byte ghosts (avg) 21-27 per sentence 2-8 per sentence
Arabic comma ، Rejected as noise Preserved
Context efficiency vs Qwen 1.52x 1.47x
Sindhi tokens added 7,978 4,571 (cleaner)

Innovations

  1. V3 Sindhi Pre-Tokenizer — Regex pattern that keeps aspirated digraphs (گھر, جھيل) intact
  2. SindhiNLTK Morpheme Pre-Segmentation — Corpus pre-segmented at morpheme boundaries before BPE training
  3. Fixed Noise Filter — Arabic comma (U+060C) no longer rejected; single-char tokens excluded
  4. 32K Vocab — Tighter budget forces smarter merges vs wasteful 40K

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v2")
tokens = tok.tokenize("ڪاوڙيندڙ ماڻهو گھر ۾ مسئلا پيدا ڪندو آهي")
print(tokens)

Training Data

Trained on sindhi-corpus-505m (742K docs, ~505M tokens).

Author

Aakash MeghwarHuggingFace · GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aakashMeghwar01/SindhiLM-Tokenizer-v2

Finetuned
(655)
this model