File size: 2,026 Bytes
b85bc9f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | ---
library_name: transformers
language: ["khm"]
license: mit
tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"]
---
# π°π KM Improved 22K Tokenizer
A **general-purpose Khmer tokenizer** optimized for both accuracy and speed.
It provides a stable backbone for Khmer NLP applications such as classification,
question answering, translation, and summarization.
---
## π§ Model Details
### Model Description
- **Developer:** Sok Meas (@Msok99)
- **Model type:** SentencePiece Unigram Tokenizer
- **Language:** Khmer (khm)
- **License:** MIT
- **Finetuned from:** None (trained from scratch)
### Model Sources
- **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k)
---
## βοΈ Uses
### Direct Use
- Tokenizing Khmer text for downstream NLP models
- Preparing training data for transformer-based fine-tuning
- Segmenting sentences for analysis or embedding generation
### Downstream Use
- Integration into Khmer LLMs or chatbots
- Pre- and post-processing for summarization or translation systems
### Out-of-Scope Use
- Not designed for English or heavily mixed KhmerβEnglish content
- Not an inference or generation model itself
---
## βοΈ Bias, Risks & Limitations
- Very long or compound words may still split into several sub-tokens
- Limited exposure to informal slang or non-standard Khmer orthography
### Recommendations
For code-switched text (Khmer + English), use the merged model
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
---
## π How to Get Started
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")
text = "αααα»αααααΆαα’α α’α₯ ααααα»ααΆααΉαα’αα·αααααααα
αα
αααα·ααααΆααααΈα"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))
|