mon_tokenizer / README.md
janakhpon's picture
feat: simplified mon tokenizer in hf format, updated tags, resolve the legacy issue
9d74fcf
metadata
language:
  - mnw
library_name: transformers
license: mit
tags:
  - tokenizer
  - mon
  - mnw
  - myanmar
  - sentencepiece
  - llama
pipeline_tag: text-generation
widget:
  - text: ဘာသာမန် ပရူပရာတံဂှ်
    example_title: Mon Language Example

Mon Language Tokenizer

A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, compatible with Hugging Face Transformers and the Llama tokenizer architecture.

Model Details

  • Language: Mon (mnw)
  • Vocabulary Size: 4,000 tokens
  • Algorithm: SentencePiece (Unigram Language Model)
  • Tokenizer Type: LlamaTokenizerFast
  • Special Tokens: <s>, </s>, <unk>, <pad>
  • Context Length: 4,096 tokens
  • Updated: August 31, 2025

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")

# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded)  # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။

Technical Specifications

  • Tokenizer Class: LlamaTokenizerFast
  • Vocabulary Type: Subword tokenization using SentencePiece
  • Training Algorithm: Unigram Language Model
  • OOV Handling: <unk> token for unknown words
  • Legacy Mode: Enabled for maximum compatibility
  • Fast Tokenizer: Includes tokenizer.json for optimal performance

Training Data

The tokenizer was trained on a comprehensive Mon language corpus including:

  • Wikipedia articles in Mon language
  • News articles and publications
  • Literary works and traditional texts
  • Modern digital content

Total training data: Not specified

Performance

  • Coverage: High coverage of Mon language vocabulary
  • Efficiency: Optimized for Mon language morphology
  • Compatibility: Full compatibility with Transformers 4.x
  • Speed: Fast tokenizer for improved performance

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research, please cite:

@misc{mon_tokenizer_2025,
  title={Mon Language Tokenizer for Hugging Face Transformers},
  author={Mon Language Project},
  year={2025},
  url={https://huggingface.co/janakhpon/mon_tokenizer}
}

Contact

For questions or issues, please open an issue on the repository or contact the maintainers.