---
language:
- mnw
library_name: transformers
license: mit
tags:
- tokenizer
- mon
- mnw
- myanmar
- sentencepiece
- llama
pipeline_tag: text-generation
widget:
- text: "ဘာသာမန် ပရူပရာတံဂှ်"
example_title: "Mon Language Example"
---
# Mon Language Tokenizer
A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens,
compatible with Hugging Face Transformers and the Llama tokenizer architecture.
## Model Details
- **Language**: Mon (mnw)
- **Vocabulary Size**: 4,000 tokens
- **Algorithm**: SentencePiece (Unigram Language Model)
- **Tokenizer Type**: LlamaTokenizerFast
- **Special Tokens**: ``, ``, ``, ``
- **Context Length**: 4,096 tokens
- **Updated**: August 31, 2025
## Usage
```python
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")
# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")
# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။
```
## Technical Specifications
- **Tokenizer Class**: `LlamaTokenizerFast`
- **Vocabulary Type**: Subword tokenization using SentencePiece
- **Training Algorithm**: Unigram Language Model
- **OOV Handling**: `` token for unknown words
- **Legacy Mode**: Enabled for maximum compatibility
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance
## Training Data
The tokenizer was trained on a comprehensive Mon language corpus including:
- Wikipedia articles in Mon language
- News articles and publications
- Literary works and traditional texts
- Modern digital content
Total training data: Not specified
## Performance
- **Coverage**: High coverage of Mon language vocabulary
- **Efficiency**: Optimized for Mon language morphology
- **Compatibility**: Full compatibility with Transformers 4.x
- **Speed**: Fast tokenizer for improved performance
## License
This tokenizer is released under the MIT License.
## Citation
If you use this tokenizer in your research, please cite:
```bibtex
@misc{mon_tokenizer_2025,
title={Mon Language Tokenizer for Hugging Face Transformers},
author={Mon Language Project},
year={2025},
url={https://huggingface.co/janakhpon/mon_tokenizer}
}
```
## Contact
For questions or issues, please open an issue on the repository or contact the maintainers.