Word Tokenizer for Kashmiri

A Word tokenizer for Kashmiri (ISO 639-3: kas) trained on KS-LIT-3M, a 3.1M-word literary corpus. Released as part of the KashTok study (Malik et al., 2026), the first systematic linguist-verified tokenization comparison for Kashmiri.

Quick Start

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Model Details

Property	Value
Tokenizer type	Word
Vocabulary size	50,000
Training corpus	KS-LIT-3M (2.47M words, 129,672 train segments)
Special tokens	`[PAD]` `[UNK]` `[CLS]` `[SEP]` `[MASK]`
Max sequence length	512
Pre-tokenization	NFC + KS_CHAR_MAP normalization

Evaluation Metrics

Computed on 16,209 truly-unseen held-out test segments from KS-LIT-3M:

Metric	Value
Composite Quality Score (CQS)	0.5121
Fertility (↓ better)	1.0004
Diacritic Preservation Score (↑)	0.9612
Morphological Boundary Alignment (↑)	N/A
Out-of-Vocabulary rate (↓)	0.0573
Reconstruction (char-level, ↑)	0.4433

See the paper for full evaluation methodology and the linguist-verified gold morpheme reference.

Recommended Use

Primary use case: Lookup, bag-of-words baselines (NOT recommended for production)

50K vocab — whole-word matching. WARNING: 5.73% test OOV.

Companion Repositories

The other four KashTok tokenizers are also available for direct comparison:

Citation

@article{malik2026kashtok,
  title  = {KashTok: Tokenizing Kashmiri at Scale with Novel
            Diacritic- and Morphology-Aware Metrics},
  author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
  year   = {2026}
}

Linguistic Verification

Every Kashmiri character, diacritic, and morpheme split used in the evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold morpheme splits).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support