--- library_name: transformers language: ["khm"] license: mit tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"] --- # πŸ‡°πŸ‡­ KM Improved 32K Tokenizer The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage across diverse domains including technical, cultural, historical, and academic texts. It aims to reduce subword fragmentation and improve contextual understanding for large-scale Khmer and multilingual language models. --- ## 🧠 Model Details ### Model Description - **Developer:** Sok Meas (@Msok99) - **Model Type:** SentencePiece Unigram - **Language:** Khmer (khm) - **License:** MIT - **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4) - **Vocabulary Size:** 32,000 - **Goal:** Maximize coverage and minimize over-segmentation ### Model Sources - **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k) --- ## βš™οΈ Key Features | Feature | Description | |----------|-------------| | **Extended Vocabulary** | 32,000 tokens for higher domain coverage | | **Improved Context Retention** | Keeps compound and rare words intact | | **Reduced Fragmentation** | Fewer subword splits across long sentences | | **Perfect Decode Fidelity** | 100% reversible encoding/decoding | | **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts | --- ## πŸ“Š Performance Overview | Category | Avg Tokens | Chars/Token | |-----------|-------------|-------------| | **Formal News** | 13.6 | 4.19 | | **Technology / Scientific** | 10.8 | 5.32 | | **Culture / History** | 11.0 | 4.58 | | **Education / Academic** | 9.4 | 5.44 | | **Mixed Texts** | 12.2 | 3.86 | | **Overall Efficiency** | β€” | **β‰ˆ4.0 chars/token** | --- ## 🧩 Use Cases ### Direct Use - Pretraining and fine-tuning Khmer LLMs - Large-scale corpus tokenization for RAG or embedding generation - Tokenization for Khmer–English mixed datasets (with limited English words) ### Downstream Use - RAG systems and document retrieval - Knowledge base construction and summarization pipelines - Academic and research-oriented text analysis ### Out-of-Scope Use - Mobile or latency-sensitive applications (consider `18k` or `22k` models) - Tokenizing purely English text --- ## βš–οΈ Bias, Risks, and Limitations - Larger vocabulary may increase model size slightly (~5–8%) - Some rare or domain-specific words might be underrepresented in informal text - Heavier memory usage during training and inference ### Recommendations For smaller models or chatbots prioritizing speed, use [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4). For mixed Khmer–English systems, use [`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k). --- ## πŸš€ How to Get Started ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k") text = "αžŸαž„αŸ’αž‚αŸ’αžšαžΆαž˜αžαŸ’αžšαž‡αžΆαž€αŸ‹αž˜αžΆαž“αž₯αž‘αŸ’αž’αž·αž–αž›αžŠαž›αŸ‹αž“αž™αŸ„αž”αžΆαž™αž–αž·αž—αž–αž›αŸ„αž€αŸ”" tokens = tokenizer.tokenize(text) print(tokens) print(tokenizer.decode(tokenizer.encode(text)))