Omarrran commited on
Commit
0cab5d5
·
verified ·
1 Parent(s): 5980847

Upload kashmiri_unigram_tokenizer/README.md with huggingface_hub

Browse files
kashmiri_unigram_tokenizer/README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ks
3
+ license: apache-2.0
4
+ tags:
5
+ - tokenizer
6
+ - kashmiri
7
+ - nlp
8
+ - low-resource
9
+ - arabic-script
10
+ - dardic
11
+ datasets:
12
+ - Omarrran/KS-LIT-3M
13
+ ---
14
+
15
+ # Kashmiri Unigram LM Tokenizer
16
+
17
+ > 🏔️ **First systematic tokenizer for Kashmiri (كٲشُر زَبان) — ISO 639-3: kas**
18
+
19
+ ## Model Description
20
+
21
+ | Property | Value |
22
+ |----------|-------|
23
+ | Architecture | Unigram LM |
24
+ | Language | Kashmiri (ks / kas) |
25
+ | Script | Perso-Arabic (Nastaliq) |
26
+ | Vocabulary Size | 32,000 |
27
+ | Training Corpus | KS-LIT-3M (3,091,180 words) |
28
+ | License | Apache-2.0 |
29
+
30
+ ## 📊 Evaluation Metrics
31
+
32
+ | Metric | Value | Description |
33
+ |--------|-------|-------------|
34
+ | Fertility | 1.2000 | Tokens per word (lower = better) |
35
+ | Diacritic Preservation Score (DPS) | 0.9859 | Novel KS-specific metric (1.0 = perfect) |
36
+ | Morphological Boundary Alignment (MBA) | 0.4467 | IoU with gold morpheme boundaries |
37
+ | OOV Rate (held-out) | 0.0000 | Tested on unseen text |
38
+ | Composite Quality Score (CQS) | 0.8848 | Weighted combination |
39
+
40
+ ## 🎯 Recommended Use Cases
41
+
42
+ Probabilistic subword. Best for multilingual models, NMT.
43
+
44
+ ## 💻 Usage
45
+
46
+ ```python
47
+ from tokenizers import Tokenizer
48
+ tokenizer = Tokenizer.from_file("tokenizer.json")
49
+ encoded = tokenizer.encode("كٲشِر زَبان چھِیہٕ بُہُت خٲص")
50
+ print("Tokens:", encoded.tokens)
51
+ decoded = tokenizer.decode(encoded.ids)
52
+ print("Decoded:", decoded)
53
+ ```
54
+
55
+ ## 📚 Citation
56
+
57
+ ```bibtex
58
+ @misc{malik2025kashmiritokenizer,
59
+ title = {A Comprehensive Tokenization Study for Kashmiri},
60
+ author = {Malik, Haq Nawaz},
61
+ year = {2025},
62
+ url = {https://huggingface.co/Omarrran/kashmiri-unigram-tokenizer},
63
+ note = {Trained on KS-LIT-3M corpus}
64
+ }
65
+ ```