--- library_name: transformers language: ["khm"] license: mit tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"] --- # πŸ‡°πŸ‡­ KM Improved 22K Tokenizer A **general-purpose Khmer tokenizer** optimized for both accuracy and speed. It provides a stable backbone for Khmer NLP applications such as classification, question answering, translation, and summarization. --- ## 🧠 Model Details ### Model Description - **Developer:** Sok Meas (@Msok99) - **Model type:** SentencePiece Unigram Tokenizer - **Language:** Khmer (khm) - **License:** MIT - **Finetuned from:** None (trained from scratch) ### Model Sources - **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k) --- ## βš™οΈ Uses ### Direct Use - Tokenizing Khmer text for downstream NLP models - Preparing training data for transformer-based fine-tuning - Segmenting sentences for analysis or embedding generation ### Downstream Use - Integration into Khmer LLMs or chatbots - Pre- and post-processing for summarization or translation systems ### Out-of-Scope Use - Not designed for English or heavily mixed Khmer–English content - Not an inference or generation model itself --- ## βš–οΈ Bias, Risks & Limitations - Very long or compound words may still split into several sub-tokens - Limited exposure to informal slang or non-standard Khmer orthography ### Recommendations For code-switched text (Khmer + English), use the merged model [`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k). --- ## πŸš€ How to Get Started ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k") text = "αž€αŸ’αž“αž»αž„αž†αŸ’αž“αžΆαŸ†αŸ’αŸ αŸ’αŸ₯ αž€αž˜αŸ’αž–αž»αž‡αžΆαž“αžΉαž„αž’αž—αž·αžœαžŒαŸ’αžαž“αŸαž”αž…αŸ’αž…αŸαž€αžœαž·αž‘αŸ’αž™αžΆαžαŸ’αž˜αžΈαŸ”" tokens = tokenizer.tokenize(text) print(tokens) print(tokenizer.decode(tokenizer.encode(text)))