km-improved-22k / README.md
Msok99's picture
Create README.md
b85bc9f verified
---
library_name: transformers
language: ["khm"]
license: mit
tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"]
---
# πŸ‡°πŸ‡­ KM Improved 22K Tokenizer
A **general-purpose Khmer tokenizer** optimized for both accuracy and speed.
It provides a stable backbone for Khmer NLP applications such as classification,
question answering, translation, and summarization.
---
## 🧠 Model Details
### Model Description
- **Developer:** Sok Meas (@Msok99)
- **Model type:** SentencePiece Unigram Tokenizer
- **Language:** Khmer (khm)
- **License:** MIT
- **Finetuned from:** None (trained from scratch)
### Model Sources
- **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k)
---
## βš™οΈ Uses
### Direct Use
- Tokenizing Khmer text for downstream NLP models
- Preparing training data for transformer-based fine-tuning
- Segmenting sentences for analysis or embedding generation
### Downstream Use
- Integration into Khmer LLMs or chatbots
- Pre- and post-processing for summarization or translation systems
### Out-of-Scope Use
- Not designed for English or heavily mixed Khmer–English content
- Not an inference or generation model itself
---
## βš–οΈ Bias, Risks & Limitations
- Very long or compound words may still split into several sub-tokens
- Limited exposure to informal slang or non-standard Khmer orthography
### Recommendations
For code-switched text (Khmer + English), use the merged model
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
---
## πŸš€ How to Get Started
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")
text = "αž€αŸ’αž“αž»αž„αž†αŸ’αž“αžΆαŸ†αŸ’αŸ αŸ’αŸ₯ αž€αž˜αŸ’αž–αž»αž‡αžΆαž“αžΉαž„αž’αž—αž·αžœαžŒαŸ’αžαž“αŸαž”αž…αŸ’αž…αŸαž€αžœαž·αž‘αŸ’αž™αžΆαžαŸ’αž˜αžΈαŸ”"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))