mon_tokenizer / README.md

feat: simplified mon tokenizer in hf format, updated tags, resolve the legacy issue

9d74fcf 4 months ago

2.65 kB

	---
	language:
	- mnw
	library_name: transformers
	license: mit
	tags:
	- tokenizer
	- mon
	- mnw
	- myanmar
	- sentencepiece
	- llama
	pipeline_tag: text-generation
	widget:
	- text: "ဘာသာမန် ပရူပရာတံဂှ်"
	example_title: "Mon Language Example"
	---

	# Mon Language Tokenizer

	A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens,
	compatible with Hugging Face Transformers and the Llama tokenizer architecture.

	## Model Details

	- Language: Mon (mnw)
	- Vocabulary Size: 4,000 tokens
	- Algorithm: SentencePiece (Unigram Language Model)
	- Tokenizer Type: LlamaTokenizerFast
	- Special Tokens: `<s>`, `</s>`, `<unk>`, `<pad>`
	- Context Length: 4,096 tokens
	- Updated: August 31, 2025

	## Usage

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

	# Tokenize Mon text
	text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
	tokens = tokenizer(text, return_tensors="pt")

	# Decode tokens back to text
	decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
	print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။
	```

	## Technical Specifications

	- Tokenizer Class: `LlamaTokenizerFast`
	- Vocabulary Type: Subword tokenization using SentencePiece
	- Training Algorithm: Unigram Language Model
	- OOV Handling: `<unk>` token for unknown words
	- Legacy Mode: Enabled for maximum compatibility
	- Fast Tokenizer: Includes tokenizer.json for optimal performance

	## Training Data

	The tokenizer was trained on a comprehensive Mon language corpus including:

	- Wikipedia articles in Mon language
	- News articles and publications
	- Literary works and traditional texts
	- Modern digital content

	Total training data: Not specified

	## Performance

	- Coverage: High coverage of Mon language vocabulary
	- Efficiency: Optimized for Mon language morphology
	- Compatibility: Full compatibility with Transformers 4.x
	- Speed: Fast tokenizer for improved performance

	## License

	This tokenizer is released under the MIT License.

	## Citation

	If you use this tokenizer in your research, please cite:

	```bibtex
	@misc{mon_tokenizer_2025,
	title={Mon Language Tokenizer for Hugging Face Transformers},
	author={Mon Language Project},
	year={2025},
	url={https://huggingface.co/janakhpon/mon_tokenizer}
	}
	```

	## Contact

	For questions or issues, please open an issue on the repository or contact the maintainers.