File size: 2,645 Bytes
2ded9d3 bece7ec 2ded9d3 bece7ec 2ded9d3 81cf36d 2ded9d3 81cf36d 2ded9d3 81cf36d 2ded9d3 81cf36d 9d74fcf 81cf36d 9d74fcf 81cf36d 2ded9d3 9ed3203 2ded9d3 81cf36d 2ded9d3 9ed3203 81cf36d 9ed3203 9d74fcf 81cf36d 9d74fcf 9ed3203 81cf36d 2ded9d3 81cf36d 2ded9d3 81cf36d 9d74fcf 81cf36d 9d74fcf 81cf36d 9d74fcf 81cf36d 2ded9d3 81cf36d 2ded9d3 81cf36d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language:
- mnw
library_name: transformers
license: mit
tags:
- tokenizer
- mon
- mnw
- myanmar
- sentencepiece
- llama
pipeline_tag: text-generation
widget:
- text: "ဘာသာမန် ပရူပရာတံဂှ်"
example_title: "Mon Language Example"
---
# Mon Language Tokenizer
A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens,
compatible with Hugging Face Transformers and the Llama tokenizer architecture.
## Model Details
- **Language**: Mon (mnw)
- **Vocabulary Size**: 4,000 tokens
- **Algorithm**: SentencePiece (Unigram Language Model)
- **Tokenizer Type**: LlamaTokenizerFast
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
- **Context Length**: 4,096 tokens
- **Updated**: August 31, 2025
## Usage
```python
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")
# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")
# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။
```
## Technical Specifications
- **Tokenizer Class**: `LlamaTokenizerFast`
- **Vocabulary Type**: Subword tokenization using SentencePiece
- **Training Algorithm**: Unigram Language Model
- **OOV Handling**: `<unk>` token for unknown words
- **Legacy Mode**: Enabled for maximum compatibility
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance
## Training Data
The tokenizer was trained on a comprehensive Mon language corpus including:
- Wikipedia articles in Mon language
- News articles and publications
- Literary works and traditional texts
- Modern digital content
Total training data: Not specified
## Performance
- **Coverage**: High coverage of Mon language vocabulary
- **Efficiency**: Optimized for Mon language morphology
- **Compatibility**: Full compatibility with Transformers 4.x
- **Speed**: Fast tokenizer for improved performance
## License
This tokenizer is released under the MIT License.
## Citation
If you use this tokenizer in your research, please cite:
```bibtex
@misc{mon_tokenizer_2025,
title={Mon Language Tokenizer for Hugging Face Transformers},
author={Mon Language Project},
year={2025},
url={https://huggingface.co/janakhpon/mon_tokenizer}
}
```
## Contact
For questions or issues, please open an issue on the repository or contact the maintainers.
|