---
language:
- mnw
library_name: transformers
license: mit
tags:
- tokenizer
- mon
- mnw
- myanmar
- sentencepiece
- llama
pipeline_tag: text-generation
widget:
- text: "ဘာသာမန် ပရူပရာတံဂှ်"
  example_title: "Mon Language Example"
---

# Mon Language Tokenizer

A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, 
compatible with Hugging Face Transformers and the Llama tokenizer architecture.

## Model Details

- **Language**: Mon (mnw)
- **Vocabulary Size**: 4,000 tokens
- **Algorithm**: SentencePiece (Unigram Language Model)
- **Tokenizer Type**: LlamaTokenizerFast
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
- **Context Length**: 4,096 tokens
- **Updated**: August 31, 2025

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")

# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded)  # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။
```

## Technical Specifications

- **Tokenizer Class**: `LlamaTokenizerFast`
- **Vocabulary Type**: Subword tokenization using SentencePiece
- **Training Algorithm**: Unigram Language Model
- **OOV Handling**: `<unk>` token for unknown words
- **Legacy Mode**: Enabled for maximum compatibility
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance

## Training Data

The tokenizer was trained on a comprehensive Mon language corpus including:

- Wikipedia articles in Mon language
- News articles and publications
- Literary works and traditional texts
- Modern digital content

Total training data: Not specified

## Performance

- **Coverage**: High coverage of Mon language vocabulary
- **Efficiency**: Optimized for Mon language morphology
- **Compatibility**: Full compatibility with Transformers 4.x
- **Speed**: Fast tokenizer for improved performance

## License

This tokenizer is released under the MIT License.

## Citation

If you use this tokenizer in your research, please cite:

```bibtex
@misc{mon_tokenizer_2025,
  title={Mon Language Tokenizer for Hugging Face Transformers},
  author={Mon Language Project},
  year={2025},
  url={https://huggingface.co/janakhpon/mon_tokenizer}
}
```

## Contact

For questions or issues, please open an issue on the repository or contact the maintainers.