|
|
--- |
|
|
language: |
|
|
- mnw |
|
|
library_name: transformers |
|
|
license: mit |
|
|
tags: |
|
|
- tokenizer |
|
|
- mon |
|
|
- mnw |
|
|
- myanmar |
|
|
- sentencepiece |
|
|
- llama |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "ဘာသာမန် ပရူပရာတံဂှ်" |
|
|
example_title: "Mon Language Example" |
|
|
--- |
|
|
|
|
|
# Mon Language Tokenizer |
|
|
|
|
|
A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, |
|
|
compatible with Hugging Face Transformers and the Llama tokenizer architecture. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Language**: Mon (mnw) |
|
|
- **Vocabulary Size**: 4,000 tokens |
|
|
- **Algorithm**: SentencePiece (Unigram Language Model) |
|
|
- **Tokenizer Type**: LlamaTokenizerFast |
|
|
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>` |
|
|
- **Context Length**: 4,096 tokens |
|
|
- **Updated**: August 31, 2025 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer") |
|
|
|
|
|
# Tokenize Mon text |
|
|
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။" |
|
|
tokens = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
# Decode tokens back to text |
|
|
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True) |
|
|
print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။ |
|
|
``` |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- **Tokenizer Class**: `LlamaTokenizerFast` |
|
|
- **Vocabulary Type**: Subword tokenization using SentencePiece |
|
|
- **Training Algorithm**: Unigram Language Model |
|
|
- **OOV Handling**: `<unk>` token for unknown words |
|
|
- **Legacy Mode**: Enabled for maximum compatibility |
|
|
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The tokenizer was trained on a comprehensive Mon language corpus including: |
|
|
|
|
|
- Wikipedia articles in Mon language |
|
|
- News articles and publications |
|
|
- Literary works and traditional texts |
|
|
- Modern digital content |
|
|
|
|
|
Total training data: Not specified |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Coverage**: High coverage of Mon language vocabulary |
|
|
- **Efficiency**: Optimized for Mon language morphology |
|
|
- **Compatibility**: Full compatibility with Transformers 4.x |
|
|
- **Speed**: Fast tokenizer for improved performance |
|
|
|
|
|
## License |
|
|
|
|
|
This tokenizer is released under the MIT License. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this tokenizer in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{mon_tokenizer_2025, |
|
|
title={Mon Language Tokenizer for Hugging Face Transformers}, |
|
|
author={Mon Language Project}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/janakhpon/mon_tokenizer} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the repository or contact the maintainers. |
|
|
|