Initial commit: KashTok tokenizer

Browse files

Files changed (4) hide show

README.md +93 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +10 -0

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+---
+language:
+- ks
+- kas
+license: apache-2.0
+tags:
+- tokenizer
+- kashmiri
+- low-resource
+- perso-arabic
+- word
+library_name: transformers
+---
+# Word Tokenizer for Kashmiri
+A Word tokenizer for Kashmiri (ISO 639-3: `kas`) trained on
+**KS-LIT-3M**, a 3.1M-word literary corpus. Released as part of the
+**KashTok** study (Malik et al., 2026), the first systematic
+linguist-verified tokenization comparison for Kashmiri.
+## Quick Start
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")
+text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
+encoding = tokenizer(text, return_tensors="pt")
+print(tokenizer.tokenize(text))
+print(encoding.input_ids)
+```
+## Model Details
+| Property | Value |
+|---|---|
+| Tokenizer type | Word |
+| Vocabulary size | 50,000 |
+| Training corpus | KS-LIT-3M (2.47M words, 129,672 train segments) |
+| Special tokens | `[PAD]` `[UNK]` `[CLS]` `[SEP]` `[MASK]` |
+| Max sequence length | 512 |
+| Pre-tokenization | NFC + KS_CHAR_MAP normalization |
+## Evaluation Metrics
+Computed on **16,209 truly-unseen** held-out test segments from KS-LIT-3M:
+| Metric | Value |
+|---|---|
+| Composite Quality Score (CQS) | 0.5121 |
+| Fertility (↓ better) | 1.0004 |
+| Diacritic Preservation Score (↑) | 0.9612 |
+| Morphological Boundary Alignment (↑) | N/A |
+| Out-of-Vocabulary rate (↓) | 0.0573 |
+| Reconstruction (char-level, ↑) | 0.4433 |
+See [the paper](https://arxiv.org/) for full evaluation methodology and
+the linguist-verified gold morpheme reference.
+## Recommended Use
+**Primary use case:** Lookup, bag-of-words baselines (NOT recommended for production)
+50K vocab — whole-word matching. WARNING: 5.73% test OOV.
+## Companion Repositories
+The other four KashTok tokenizers are also available for direct comparison:
+- [Kashmiri_Char_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Char_Tokenizer)
+- [Kashmiri_Word_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Word_Tokenizer)
+- [Kashmiri_WordPiece_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_WordPiece_Tokenizer)
+- [Kashmiri_BPE_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_BPE_Tokenizer)
+- [Kashmiri_Unigram_Tokenizer](https://huggingface.co/Omarrran/Kashmiri_Unigram_Tokenizer)
+## Citation
+```bibtex
+@article{malik2026kashtok,
+  title  = {KashTok: Tokenizing Kashmiri at Scale with Novel
+            Diacritic- and Morphology-Aware Metrics},
+  author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
+  year   = {2026}
+}
+```
+## Linguistic Verification
+Every Kashmiri character, diacritic, and morpheme split used in the
+evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker
+linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold
+morpheme splits).

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]",
+  "cls_token": "[CLS]",
+  "sep_token": "[SEP]",
+  "mask_token": "[MASK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "backend": "tokenizers",
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}