|
|
--- |
|
|
library_name: transformers |
|
|
language: ["khm"] |
|
|
license: mit |
|
|
tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"] |
|
|
--- |
|
|
|
|
|
# π°π KM Improved 22K Tokenizer |
|
|
|
|
|
A **general-purpose Khmer tokenizer** optimized for both accuracy and speed. |
|
|
It provides a stable backbone for Khmer NLP applications such as classification, |
|
|
question answering, translation, and summarization. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Developer:** Sok Meas (@Msok99) |
|
|
- **Model type:** SentencePiece Unigram Tokenizer |
|
|
- **Language:** Khmer (khm) |
|
|
- **License:** MIT |
|
|
- **Finetuned from:** None (trained from scratch) |
|
|
|
|
|
### Model Sources |
|
|
- **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k) |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Uses |
|
|
|
|
|
### Direct Use |
|
|
- Tokenizing Khmer text for downstream NLP models |
|
|
- Preparing training data for transformer-based fine-tuning |
|
|
- Segmenting sentences for analysis or embedding generation |
|
|
|
|
|
### Downstream Use |
|
|
- Integration into Khmer LLMs or chatbots |
|
|
- Pre- and post-processing for summarization or translation systems |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- Not designed for English or heavily mixed KhmerβEnglish content |
|
|
- Not an inference or generation model itself |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Bias, Risks & Limitations |
|
|
- Very long or compound words may still split into several sub-tokens |
|
|
- Limited exposure to informal slang or non-standard Khmer orthography |
|
|
|
|
|
### Recommendations |
|
|
For code-switched text (Khmer + English), use the merged model |
|
|
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k). |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Get Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k") |
|
|
|
|
|
text = "αααα»αααααΆαα’α α’α₯ ααααα»ααΆααΉαα’αα·αααααααα
αα
αααα·ααααΆααααΈα" |
|
|
tokens = tokenizer.tokenize(text) |
|
|
print(tokens) |
|
|
print(tokenizer.decode(tokenizer.encode(text))) |
|
|
|