Add tokenizer v2 model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- sd
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- sindhi
|
| 7 |
+
- tokenizer
|
| 8 |
+
- bpe
|
| 9 |
+
- morphology
|
| 10 |
+
- arabic-script
|
| 11 |
+
- low-resource
|
| 12 |
+
base_model: Qwen/Qwen2.5-0.5B-Instruct
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# SindhiLM-Tokenizer-v2
|
| 16 |
+
|
| 17 |
+
Morpheme-boundary-aware BPE tokenizer for Sindhi, merged into Qwen2.5-0.5B-Instruct.
|
| 18 |
+
|
| 19 |
+
## Key Improvements over v1
|
| 20 |
+
|
| 21 |
+
| Feature | v1 | v2 |
|
| 22 |
+
|---------|----|----|
|
| 23 |
+
| Root integrity (`ڪاوڙ`) | Broken (`ڪاو\|ڙ`) | Intact (`ڪاوڙ\|يندڙ`) |
|
| 24 |
+
| Byte ghosts (avg) | 21-27 per sentence | 2-8 per sentence |
|
| 25 |
+
| Arabic comma `،` | Rejected as noise | Preserved |
|
| 26 |
+
| Context efficiency vs Qwen | 1.52x | 1.47x |
|
| 27 |
+
| Sindhi tokens added | 7,978 | 4,571 (cleaner) |
|
| 28 |
+
|
| 29 |
+
## Innovations
|
| 30 |
+
|
| 31 |
+
1. **V3 Sindhi Pre-Tokenizer** — Regex pattern that keeps aspirated digraphs (گھر, جھيل) intact
|
| 32 |
+
2. **SindhiNLTK Morpheme Pre-Segmentation** — Corpus pre-segmented at morpheme boundaries before BPE training
|
| 33 |
+
3. **Fixed Noise Filter** — Arabic comma (U+060C) no longer rejected; single-char tokens excluded
|
| 34 |
+
4. **32K Vocab** — Tighter budget forces smarter merges vs wasteful 40K
|
| 35 |
+
|
| 36 |
+
## Usage
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
from transformers import AutoTokenizer
|
| 40 |
+
|
| 41 |
+
tok = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v2")
|
| 42 |
+
tokens = tok.tokenize("ڪاوڙيندڙ ماڻهو گھر ۾ مسئلا پيدا ڪندو آهي")
|
| 43 |
+
print(tokens)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
## Training Data
|
| 47 |
+
|
| 48 |
+
Trained on [sindhi-corpus-505m](https://huggingface.co/datasets/aakashMeghwar01/sindhi-corpus-505m) (742K docs, ~505M tokens).
|
| 49 |
+
|
| 50 |
+
## Author
|
| 51 |
+
|
| 52 |
+
**Aakash Meghwar** — [HuggingFace](https://huggingface.co/aakashMeghwar01) · [GitHub](https://github.com/AakashKumarMissrani)
|