aakashMeghwar01 commited on
Commit
13b28c9
·
verified ·
1 Parent(s): 68cbe63

Add tokenizer v2 model card

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sd
4
+ license: apache-2.0
5
+ tags:
6
+ - sindhi
7
+ - tokenizer
8
+ - bpe
9
+ - morphology
10
+ - arabic-script
11
+ - low-resource
12
+ base_model: Qwen/Qwen2.5-0.5B-Instruct
13
+ ---
14
+
15
+ # SindhiLM-Tokenizer-v2
16
+
17
+ Morpheme-boundary-aware BPE tokenizer for Sindhi, merged into Qwen2.5-0.5B-Instruct.
18
+
19
+ ## Key Improvements over v1
20
+
21
+ | Feature | v1 | v2 |
22
+ |---------|----|----|
23
+ | Root integrity (`ڪاوڙ`) | Broken (`ڪاو\|ڙ`) | Intact (`ڪاوڙ\|يندڙ`) |
24
+ | Byte ghosts (avg) | 21-27 per sentence | 2-8 per sentence |
25
+ | Arabic comma `،` | Rejected as noise | Preserved |
26
+ | Context efficiency vs Qwen | 1.52x | 1.47x |
27
+ | Sindhi tokens added | 7,978 | 4,571 (cleaner) |
28
+
29
+ ## Innovations
30
+
31
+ 1. **V3 Sindhi Pre-Tokenizer** — Regex pattern that keeps aspirated digraphs (گھر, جھيل) intact
32
+ 2. **SindhiNLTK Morpheme Pre-Segmentation** — Corpus pre-segmented at morpheme boundaries before BPE training
33
+ 3. **Fixed Noise Filter** — Arabic comma (U+060C) no longer rejected; single-char tokens excluded
34
+ 4. **32K Vocab** — Tighter budget forces smarter merges vs wasteful 40K
35
+
36
+ ## Usage
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer
40
+
41
+ tok = AutoTokenizer.from_pretrained("aakashMeghwar01/SindhiLM-Tokenizer-v2")
42
+ tokens = tok.tokenize("ڪاوڙيندڙ ماڻهو گھر ۾ مسئلا پيدا ڪندو آهي")
43
+ print(tokens)
44
+ ```
45
+
46
+ ## Training Data
47
+
48
+ Trained on [sindhi-corpus-505m](https://huggingface.co/datasets/aakashMeghwar01/sindhi-corpus-505m) (742K docs, ~505M tokens).
49
+
50
+ ## Author
51
+
52
+ **Aakash Meghwar** — [HuggingFace](https://huggingface.co/aakashMeghwar01) · [GitHub](https://github.com/AakashKumarMissrani)