Character Tokenizer for Kashmiri

A Character tokenizer for Kashmiri (ISO 639-3: kas) trained on KS-LIT-3M, a 3.1M-word literary corpus. Released as part of the KashTok study (Malik et al., 2026), the first systematic linguist-verified tokenization comparison for Kashmiri.

Quick Start

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Char_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Model Details

Property Value
Tokenizer type Character
Vocabulary size 133
Training corpus KS-LIT-3M (2.47M words, 129,672 train segments)
Special tokens [PAD] [UNK] [CLS] [SEP] [MASK]
Max sequence length 512
Pre-tokenization NFC + KS_CHAR_MAP normalization

Evaluation Metrics

Computed on 16,209 truly-unseen held-out test segments from KS-LIT-3M:

Metric Value
Composite Quality Score (CQS) 0.3107
Fertility (↓ better) 5.2453
Diacritic Preservation Score (↑) 0.0000
Morphological Boundary Alignment (↑) 0.2104
Out-of-Vocabulary rate (↓) 0.0000
Reconstruction (char-level, ↑) N/A

See the paper for full evaluation methodology and the linguist-verified gold morpheme reference.

Recommended Use

Primary use case: ASR/OCR post-correction, character-level models, error analysis

133 vocab — one token per Unicode codepoint. Fertility 5.25.

Companion Repositories

The other four KashTok tokenizers are also available for direct comparison:

Citation

@article{malik2026kashtok,
  title  = {KashTok: Tokenizing Kashmiri at Scale with Novel
            Diacritic- and Morphology-Aware Metrics},
  author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
  year   = {2026}
}

Linguistic Verification

Every Kashmiri character, diacritic, and morpheme split used in the evaluation of this tokenizer was confirmed by a native-Kashmiri-speaker linguistic review (40 consonants, 7 vowels, 11 diacritics, 26 gold morpheme splits).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support