Word Tokenizer for Kashmiri
A Word tokenizer for Kashmiri (ISO 639-3: kas) trained on
KS-LIT-3M, a 3.1M-word literary corpus. Released as part of the
KashTok study (Malik et al., 2026), the first systematic
linguist-verified tokenization comparison for Kashmiri.
Quick Start
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Model Details
| Property | Value |
|---|---|
| Tokenizer type | Word |
| Vocabulary size | 50,000 |
| Training corpus | KS-LIT-3M (2.47M words, 129,672 train segments) |
| Special tokens | [PAD] [UNK] [CLS] [SEP] [MASK] |
| Max sequence length | 512 |
| Pre-tokenization | NFC + KS_CHAR_MAP normalization |
Evaluation Metrics
Computed on 16,209 truly-unseen held-out test segments from KS-LIT-3M:
| Metric | Value |
|---|---|
| Composite Quality Score (CQS) | 0.5121 |
| Fertility (↓ better) | 1.0004 |
| Diacritic Preservation Score (↑) | 0.9612 |
| Morphological Boundary Alignment (↑) | N/A |
| Out-of-Vocabulary rate (↓) | 0.0573 |
| Reconstruction (char-level, ↑) | 0.4433 |
See the paper for full evaluation methodology and the linguist-verified gold morpheme reference.
Recommended Use
Primary use case: Lookup, bag-of-words baselines (NOT recommended for production)
50K vocab — whole-word matching. WARNING: 5.73% test OOV.
Companion Repositories
The other four KashTok tokenizers are also available for direct comparison:
- Kashmiri_Char_Tokenizer
- Kashmiri_Word_Tokenizer
- Kashmiri_WordPiece_Tokenizer
- Kashmiri_BPE_Tokenizer
- Kashmiri_Unigram_Tokenizer
Citation
@article{malik2026kashtok,
title = {KashTok: Tokenizing Kashmiri at Scale with Novel
Diacritic- and Morphology-Aware Metrics},
author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
year = {2026}
}
Linguistic Verification
Every Kashmiri character, diacritic, and morpheme split used in the evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold morpheme splits).