|
|
--- |
|
|
library_name: transformers |
|
|
language: ["khm"] |
|
|
license: mit |
|
|
tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"] |
|
|
--- |
|
|
|
|
|
# π°π KM Improved 32K Tokenizer |
|
|
|
|
|
The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage |
|
|
across diverse domains including technical, cultural, historical, and academic texts. |
|
|
It aims to reduce subword fragmentation and improve contextual understanding for large-scale |
|
|
Khmer and multilingual language models. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Developer:** Sok Meas (@Msok99) |
|
|
- **Model Type:** SentencePiece Unigram |
|
|
- **Language:** Khmer (khm) |
|
|
- **License:** MIT |
|
|
- **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4) |
|
|
- **Vocabulary Size:** 32,000 |
|
|
- **Goal:** Maximize coverage and minimize over-segmentation |
|
|
|
|
|
### Model Sources |
|
|
- **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k) |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Key Features |
|
|
|
|
|
| Feature | Description | |
|
|
|----------|-------------| |
|
|
| **Extended Vocabulary** | 32,000 tokens for higher domain coverage | |
|
|
| **Improved Context Retention** | Keeps compound and rare words intact | |
|
|
| **Reduced Fragmentation** | Fewer subword splits across long sentences | |
|
|
| **Perfect Decode Fidelity** | 100% reversible encoding/decoding | |
|
|
| **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance Overview |
|
|
|
|
|
| Category | Avg Tokens | Chars/Token | |
|
|
|-----------|-------------|-------------| |
|
|
| **Formal News** | 13.6 | 4.19 | |
|
|
| **Technology / Scientific** | 10.8 | 5.32 | |
|
|
| **Culture / History** | 11.0 | 4.58 | |
|
|
| **Education / Academic** | 9.4 | 5.44 | |
|
|
| **Mixed Texts** | 12.2 | 3.86 | |
|
|
| **Overall Efficiency** | β | **β4.0 chars/token** | |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Use Cases |
|
|
|
|
|
### Direct Use |
|
|
- Pretraining and fine-tuning Khmer LLMs |
|
|
- Large-scale corpus tokenization for RAG or embedding generation |
|
|
- Tokenization for KhmerβEnglish mixed datasets (with limited English words) |
|
|
|
|
|
### Downstream Use |
|
|
- RAG systems and document retrieval |
|
|
- Knowledge base construction and summarization pipelines |
|
|
- Academic and research-oriented text analysis |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- Mobile or latency-sensitive applications (consider `18k` or `22k` models) |
|
|
- Tokenizing purely English text |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Bias, Risks, and Limitations |
|
|
- Larger vocabulary may increase model size slightly (~5β8%) |
|
|
- Some rare or domain-specific words might be underrepresented in informal text |
|
|
- Heavier memory usage during training and inference |
|
|
|
|
|
### Recommendations |
|
|
For smaller models or chatbots prioritizing speed, use |
|
|
[`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4). |
|
|
For mixed KhmerβEnglish systems, use |
|
|
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k). |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Get Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k") |
|
|
|
|
|
text = "αααααααΆααααααΆααααΆαα₯αααα·ααααααααααΆααα·αααααα" |
|
|
tokens = tokenizer.tokenize(text) |
|
|
print(tokens) |
|
|
print(tokenizer.decode(tokenizer.encode(text))) |
|
|
|