File size: 2,026 Bytes
b85bc9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
library_name: transformers
language: ["khm"]
license: mit
tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"]
---

# πŸ‡°πŸ‡­ KM Improved 22K Tokenizer

A **general-purpose Khmer tokenizer** optimized for both accuracy and speed.  
It provides a stable backbone for Khmer NLP applications such as classification,  
question answering, translation, and summarization.

---

## 🧠 Model Details

### Model Description
- **Developer:** Sok Meas (@Msok99)  
- **Model type:** SentencePiece Unigram Tokenizer  
- **Language:** Khmer (khm)  
- **License:** MIT  
- **Finetuned from:** None (trained from scratch)

### Model Sources
- **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k)

---

## βš™οΈ Uses

### Direct Use
- Tokenizing Khmer text for downstream NLP models  
- Preparing training data for transformer-based fine-tuning  
- Segmenting sentences for analysis or embedding generation  

### Downstream Use
- Integration into Khmer LLMs or chatbots  
- Pre- and post-processing for summarization or translation systems  

### Out-of-Scope Use
- Not designed for English or heavily mixed Khmer–English content  
- Not an inference or generation model itself  

---

## βš–οΈ Bias, Risks & Limitations
- Very long or compound words may still split into several sub-tokens  
- Limited exposure to informal slang or non-standard Khmer orthography  

### Recommendations
For code-switched text (Khmer + English), use the merged model  
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).

---

## πŸš€ How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")

text = "αž€αŸ’αž“αž»αž„αž†αŸ’αž“αžΆαŸ†αŸ’αŸ αŸ’αŸ₯ αž€αž˜αŸ’αž–αž»αž‡αžΆαž“αžΉαž„αž’αž—αž·αžœαžŒαŸ’αžαž“αŸαž”αž…αŸ’αž…αŸαž€αžœαž·αž‘αŸ’αž™αžΆαžαŸ’αž˜αžΈαŸ”"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))