km-improved-32k / README.md
Msok99's picture
Create README.md
b5cbb66 verified
---
library_name: transformers
language: ["khm"]
license: mit
tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"]
---
# πŸ‡°πŸ‡­ KM Improved 32K Tokenizer
The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage
across diverse domains including technical, cultural, historical, and academic texts.
It aims to reduce subword fragmentation and improve contextual understanding for large-scale
Khmer and multilingual language models.
---
## 🧠 Model Details
### Model Description
- **Developer:** Sok Meas (@Msok99)
- **Model Type:** SentencePiece Unigram
- **Language:** Khmer (khm)
- **License:** MIT
- **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4)
- **Vocabulary Size:** 32,000
- **Goal:** Maximize coverage and minimize over-segmentation
### Model Sources
- **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k)
---
## βš™οΈ Key Features
| Feature | Description |
|----------|-------------|
| **Extended Vocabulary** | 32,000 tokens for higher domain coverage |
| **Improved Context Retention** | Keeps compound and rare words intact |
| **Reduced Fragmentation** | Fewer subword splits across long sentences |
| **Perfect Decode Fidelity** | 100% reversible encoding/decoding |
| **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts |
---
## πŸ“Š Performance Overview
| Category | Avg Tokens | Chars/Token |
|-----------|-------------|-------------|
| **Formal News** | 13.6 | 4.19 |
| **Technology / Scientific** | 10.8 | 5.32 |
| **Culture / History** | 11.0 | 4.58 |
| **Education / Academic** | 9.4 | 5.44 |
| **Mixed Texts** | 12.2 | 3.86 |
| **Overall Efficiency** | β€” | **β‰ˆ4.0 chars/token** |
---
## 🧩 Use Cases
### Direct Use
- Pretraining and fine-tuning Khmer LLMs
- Large-scale corpus tokenization for RAG or embedding generation
- Tokenization for Khmer–English mixed datasets (with limited English words)
### Downstream Use
- RAG systems and document retrieval
- Knowledge base construction and summarization pipelines
- Academic and research-oriented text analysis
### Out-of-Scope Use
- Mobile or latency-sensitive applications (consider `18k` or `22k` models)
- Tokenizing purely English text
---
## βš–οΈ Bias, Risks, and Limitations
- Larger vocabulary may increase model size slightly (~5–8%)
- Some rare or domain-specific words might be underrepresented in informal text
- Heavier memory usage during training and inference
### Recommendations
For smaller models or chatbots prioritizing speed, use
[`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4).
For mixed Khmer–English systems, use
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
---
## πŸš€ How to Get Started
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k")
text = "αžŸαž„αŸ’αž‚αŸ’αžšαžΆαž˜αžαŸ’αžšαž‡αžΆαž€αŸ‹αž˜αžΆαž“αž₯αž‘αŸ’αž’αž·αž–αž›αžŠαž›αŸ‹αž“αž™αŸ„αž”αžΆαž™αž–αž·αž—αž–αž›αŸ„αž€αŸ”"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))