Upload kashmiri_unigram_tokenizer/README.md with huggingface_hub

Browse files

Files changed (1) hide show

kashmiri_unigram_tokenizer/README.md +65 -0

kashmiri_unigram_tokenizer/README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+---
+language: ks
+license: apache-2.0
+tags:
+  - tokenizer
+  - kashmiri
+  - nlp
+  - low-resource
+  - arabic-script
+  - dardic
+datasets:
+  - Omarrran/KS-LIT-3M
+---
+# Kashmiri Unigram LM Tokenizer
+> 🏔️ **First systematic tokenizer for Kashmiri (كٲشُر زَبان) — ISO 639-3: kas**
+## Model Description
+| Property | Value |
+|----------|-------|
+| Architecture | Unigram LM |
+| Language | Kashmiri (ks / kas) |
+| Script | Perso-Arabic (Nastaliq) |
+| Vocabulary Size | 32,000 |
+| Training Corpus | KS-LIT-3M (3,091,180 words) |
+| License | Apache-2.0 |
+## 📊 Evaluation Metrics
+| Metric | Value | Description |
+|--------|-------|-------------|
+| Fertility | 1.2000 | Tokens per word (lower = better) |
+| Diacritic Preservation Score (DPS) | 0.9859 | Novel KS-specific metric (1.0 = perfect) |
+| Morphological Boundary Alignment (MBA) | 0.4467 | IoU with gold morpheme boundaries |
+| OOV Rate (held-out) | 0.0000 | Tested on unseen text |
+| Composite Quality Score (CQS) | 0.8848 | Weighted combination |
+## 🎯 Recommended Use Cases
+Probabilistic subword. Best for multilingual models, NMT.
+## 💻 Usage
+```python
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+encoded = tokenizer.encode("كٲشِر زَبان چھِیہٕ بُہُت خٲص")
+print("Tokens:", encoded.tokens)
+decoded = tokenizer.decode(encoded.ids)
+print("Decoded:", decoded)
+```
+## 📚 Citation
+```bibtex
+@misc{malik2025kashmiritokenizer,
+  title   = {A Comprehensive Tokenization Study for Kashmiri},
+  author  = {Malik, Haq Nawaz},
+  year    = {2025},
+  url     = {https://huggingface.co/Omarrran/kashmiri-unigram-tokenizer},
+  note    = {Trained on KS-LIT-3M corpus}
+}
+```