---
library_name: transformers
language: ["khm"]
license: mit
tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"]
---

# 🇰🇭 KM Improved 32K Tokenizer

The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage  
across diverse domains including technical, cultural, historical, and academic texts.  
It aims to reduce subword fragmentation and improve contextual understanding for large-scale  
Khmer and multilingual language models.

---

## 🧠 Model Details

### Model Description
- **Developer:** Sok Meas (@Msok99)  
- **Model Type:** SentencePiece Unigram  
- **Language:** Khmer (khm)  
- **License:** MIT  
- **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4)  
- **Vocabulary Size:** 32,000  
- **Goal:** Maximize coverage and minimize over-segmentation  

### Model Sources
- **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k)

---

## ⚙️ Key Features

| Feature | Description |
|----------|-------------|
| **Extended Vocabulary** | 32,000 tokens for higher domain coverage |
| **Improved Context Retention** | Keeps compound and rare words intact |
| **Reduced Fragmentation** | Fewer subword splits across long sentences |
| **Perfect Decode Fidelity** | 100% reversible encoding/decoding |
| **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts |

---

## 📊 Performance Overview

| Category | Avg Tokens | Chars/Token |
|-----------|-------------|-------------|
| **Formal News** | 13.6 | 4.19 |
| **Technology / Scientific** | 10.8 | 5.32 |
| **Culture / History** | 11.0 | 4.58 |
| **Education / Academic** | 9.4 | 5.44 |
| **Mixed Texts** | 12.2 | 3.86 | 
| **Overall Efficiency** | — | **≈4.0 chars/token** | 

---

## 🧩 Use Cases

### Direct Use
- Pretraining and fine-tuning Khmer LLMs  
- Large-scale corpus tokenization for RAG or embedding generation  
- Tokenization for Khmer–English mixed datasets (with limited English words)

### Downstream Use
- RAG systems and document retrieval  
- Knowledge base construction and summarization pipelines  
- Academic and research-oriented text analysis  

### Out-of-Scope Use
- Mobile or latency-sensitive applications (consider `18k` or `22k` models)  
- Tokenizing purely English text  

---

## ⚖️ Bias, Risks, and Limitations
- Larger vocabulary may increase model size slightly (~5–8%)  
- Some rare or domain-specific words might be underrepresented in informal text  
- Heavier memory usage during training and inference  

### Recommendations
For smaller models or chatbots prioritizing speed, use  
[`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4).  
For mixed Khmer–English systems, use  
[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).

---

## 🚀 How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k")

text = "សង្គ្រាមត្រជាក់មានឥទ្ធិពលដល់នយោបាយពិភពលោក។"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))