--- language: - mnw library_name: transformers license: mit tags: - tokenizer - mon - mnw - myanmar - sentencepiece - llama pipeline_tag: text-generation widget: - text: "ဘာသာမန် ပရူပရာတံဂှ်" example_title: "Mon Language Example" --- # Mon Language Tokenizer A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, compatible with Hugging Face Transformers and the Llama tokenizer architecture. ## Model Details - **Language**: Mon (mnw) - **Vocabulary Size**: 4,000 tokens - **Algorithm**: SentencePiece (Unigram Language Model) - **Tokenizer Type**: LlamaTokenizerFast - **Special Tokens**: ``, ``, ``, `` - **Context Length**: 4,096 tokens - **Updated**: August 31, 2025 ## Usage ```python from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer") # Tokenize Mon text text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။" tokens = tokenizer(text, return_tensors="pt") # Decode tokens back to text decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True) print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။ ``` ## Technical Specifications - **Tokenizer Class**: `LlamaTokenizerFast` - **Vocabulary Type**: Subword tokenization using SentencePiece - **Training Algorithm**: Unigram Language Model - **OOV Handling**: `` token for unknown words - **Legacy Mode**: Enabled for maximum compatibility - **Fast Tokenizer**: Includes tokenizer.json for optimal performance ## Training Data The tokenizer was trained on a comprehensive Mon language corpus including: - Wikipedia articles in Mon language - News articles and publications - Literary works and traditional texts - Modern digital content Total training data: Not specified ## Performance - **Coverage**: High coverage of Mon language vocabulary - **Efficiency**: Optimized for Mon language morphology - **Compatibility**: Full compatibility with Transformers 4.x - **Speed**: Fast tokenizer for improved performance ## License This tokenizer is released under the MIT License. ## Citation If you use this tokenizer in your research, please cite: ```bibtex @misc{mon_tokenizer_2025, title={Mon Language Tokenizer for Hugging Face Transformers}, author={Mon Language Project}, year={2025}, url={https://huggingface.co/janakhpon/mon_tokenizer} } ``` ## Contact For questions or issues, please open an issue on the repository or contact the maintainers.