Omarrran commited on
Commit
cd8ac28
·
verified ·
1 Parent(s): 2d492a8

Initial commit: KashTok tokenizer

Browse files
Files changed (4) hide show
  1. README.md +93 -0
  2. special_tokens_map.json +7 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +10 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ks
4
+ - kas
5
+ license: apache-2.0
6
+ tags:
7
+ - tokenizer
8
+ - kashmiri
9
+ - low-resource
10
+ - perso-arabic
11
+ - word
12
+ library_name: transformers
13
+ ---
14
+
15
+ # Word Tokenizer for Kashmiri
16
+
17
+ A Word tokenizer for Kashmiri (ISO 639-3: `kas`) trained on
18
+ **KS-LIT-3M**, a 3.1M-word literary corpus. Released as part of the
19
+ **KashTok** study (Malik et al., 2026), the first systematic
20
+ linguist-verified tokenization comparison for Kashmiri.
21
+
22
+ ## Quick Start
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer
26
+ tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")
27
+
28
+ text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
29
+ encoding = tokenizer(text, return_tensors="pt")
30
+ print(tokenizer.tokenize(text))
31
+ print(encoding.input_ids)
32
+ ```
33
+
34
+ ## Model Details
35
+
36
+ | Property | Value |
37
+ |---|---|
38
+ | Tokenizer type | Word |
39
+ | Vocabulary size | 50,000 |
40
+ | Training corpus | KS-LIT-3M (2.47M words, 129,672 train segments) |
41
+ | Special tokens | `[PAD]` `[UNK]` `[CLS]` `[SEP]` `[MASK]` |
42
+ | Max sequence length | 512 |
43
+ | Pre-tokenization | NFC + KS_CHAR_MAP normalization |
44
+
45
+ ## Evaluation Metrics
46
+
47
+ Computed on **16,209 truly-unseen** held-out test segments from KS-LIT-3M:
48
+
49
+ | Metric | Value |
50
+ |---|---|
51
+ | Composite Quality Score (CQS) | 0.5121 |
52
+ | Fertility (↓ better) | 1.0004 |
53
+ | Diacritic Preservation Score (↑) | 0.9612 |
54
+ | Morphological Boundary Alignment (↑) | N/A |
55
+ | Out-of-Vocabulary rate (↓) | 0.0573 |
56
+ | Reconstruction (char-level, ↑) | 0.4433 |
57
+
58
+ See [the paper](https://arxiv.org/) for full evaluation methodology and
59
+ the linguist-verified gold morpheme reference.
60
+
61
+ ## Recommended Use
62
+
63
+ **Primary use case:** Lookup, bag-of-words baselines (NOT recommended for production)
64
+
65
+ 50K vocab — whole-word matching. WARNING: 5.73% test OOV.
66
+
67
+ ## Companion Repositories
68
+
69
+ The other four KashTok tokenizers are also available for direct comparison:
70
+
71
+ - [Kashmiri_Char_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Char_Tokenizer)
72
+ - [Kashmiri_Word_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Word_Tokenizer)
73
+ - [Kashmiri_WordPiece_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_WordPiece_Tokenizer)
74
+ - [Kashmiri_BPE_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_BPE_Tokenizer)
75
+ - [Kashmiri_Unigram_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Unigram_Tokenizer)
76
+
77
+ ## Citation
78
+
79
+ ```bibtex
80
+ @article{malik2026kashtok,
81
+ title = {KashTok: Tokenizing Kashmiri at Scale with Novel
82
+ Diacritic- and Morphology-Aware Metrics},
83
+ author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
84
+ year = {2026}
85
+ }
86
+ ```
87
+
88
+ ## Linguistic Verification
89
+
90
+ Every Kashmiri character, diacritic, and morpheme split used in the
91
+ evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker
92
+ linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold
93
+ morpheme splits).
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "pad_token": "[PAD]",
3
+ "unk_token": "[UNK]",
4
+ "cls_token": "[CLS]",
5
+ "sep_token": "[SEP]",
6
+ "mask_token": "[MASK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "cls_token": "[CLS]",
4
+ "mask_token": "[MASK]",
5
+ "model_max_length": 512,
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "tokenizer_class": "TokenizersBackend",
9
+ "unk_token": "[UNK]"
10
+ }