mon_tokenizer / README.md

janakhpon

feat: simplified mon tokenizer in hf format, updated tags, resolve the legacy issue

9d74fcf 4 months ago

preview code

raw

history blame contribute delete

2.65 kB

metadata

language:
  - mnw
library_name: transformers
license: mit
tags:
  - tokenizer
  - mon
  - mnw
  - myanmar
  - sentencepiece
  - llama
pipeline_tag: text-generation
widget:
  - text: ဘာသာမန် ပရူပရာတံဂှ်
    example_title: Mon Language Example

Mon Language Tokenizer

A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, compatible with Hugging Face Transformers and the Llama tokenizer architecture.

Model Details

Language: Mon (mnw)
Vocabulary Size: 4,000 tokens
Algorithm: SentencePiece (Unigram Language Model)
Tokenizer Type: LlamaTokenizerFast
Special Tokens: <s>, </s>, <unk>, <pad>
Context Length: 4,096 tokens
Updated: August 31, 2025

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")

# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded)  # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။

Technical Specifications

Tokenizer Class: LlamaTokenizerFast
Vocabulary Type: Subword tokenization using SentencePiece
Training Algorithm: Unigram Language Model
OOV Handling: <unk> token for unknown words
Legacy Mode: Enabled for maximum compatibility
Fast Tokenizer: Includes tokenizer.json for optimal performance

Training Data

The tokenizer was trained on a comprehensive Mon language corpus including:

Wikipedia articles in Mon language
News articles and publications
Literary works and traditional texts
Modern digital content

Total training data: Not specified

Performance

Coverage: High coverage of Mon language vocabulary
Efficiency: Optimized for Mon language morphology
Compatibility: Full compatibility with Transformers 4.x
Speed: Fast tokenizer for improved performance

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research, please cite:

@misc{mon_tokenizer_2025,
  title={Mon Language Tokenizer for Hugging Face Transformers},
  author={Mon Language Project},
  year={2025},
  url={https://huggingface.co/janakhpon/mon_tokenizer}
}

Contact

For questions or issues, please open an issue on the repository or contact the maintainers.