Word Tokenizer for Kashmiri

A Word tokenizer for Kashmiri (ISO 639-3: kas) trained on KS-LIT-3M, a 3.1M-word literary corpus. Released as part of the KashTok study (Malik et al., 2026), the first systematic linguist-verified tokenization comparison for Kashmiri.

Quick Start

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Model Details

Property Value
Tokenizer type Word
Vocabulary size 50,000
Training corpus KS-LIT-3M (2.47M words, 129,672 train segments)
Special tokens [PAD] [UNK] [CLS] [SEP] [MASK]
Max sequence length 512
Pre-tokenization NFC + KS_CHAR_MAP normalization

Evaluation Metrics

Computed on 16,209 truly-unseen held-out test segments from KS-LIT-3M:

Metric Value
Composite Quality Score (CQS) 0.5121
Fertility (↓ better) 1.0004
Diacritic Preservation Score (↑) 0.9612
Morphological Boundary Alignment (↑) N/A
Out-of-Vocabulary rate (↓) 0.0573
Reconstruction (char-level, ↑) 0.4433

See the paper for full evaluation methodology and the linguist-verified gold morpheme reference.

Recommended Use

Primary use case: Lookup, bag-of-words baselines (NOT recommended for production)

50K vocab — whole-word matching. WARNING: 5.73% test OOV.

Companion Repositories

The other four KashTok tokenizers are also available for direct comparison:

Citation

@article{malik2026kashtok,
  title  = {KashTok: Tokenizing Kashmiri at Scale with Novel
            Diacritic- and Morphology-Aware Metrics},
  author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
  year   = {2026}
}

Linguistic Verification

Every Kashmiri character, diacritic, and morpheme split used in the evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold morpheme splits).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support