Mon Language Tokenizer

A SentencePiece Unigram tokenizer for the Mon language (mnw).

Performance

Trained on 41.4M Mon-related characters (within a 92.8M total character / 176.7M byte raw corpus) across 8,841 documents.

Metric	Result
Vocabulary size	32,000
Avg compression	5.22 chars/token
Round-trip accuracy	100%
Byte-fallback rate	0.00%
Model size	977 KB

Previous version (v0.1, 4k vocab, 2.4M chars): ~1.99 chars/token. This release is 2.6× more efficient.

Usage

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("janakhpon/mon_tokenizer")

text = "မန်တံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens, skip_special_tokens=True)
print(decoded)  # မန်တံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။

Or via the standalone Python package:

pip install mon-tokenizer

from mon_tokenizer import MonTokenizer
tok = MonTokenizer()
result = tok.encode("မန်တံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။")
print(result["pieces"])

Special tokens

Token	ID
`<unk>`	0
`<s>`	1
`</s>`	2
`<mask>`	3
`<sep>`	4
`<cls>`	5
`<pad>`	32000

Design

Algorithm: SentencePiece Unigram — superior to BPE for Mon's agglutinative morphology
Grapheme atomicity: Myanmar syllable components injected as user_defined_symbols so the tokenizer never splits a syllable mid-stack
Normalization: NFC applied before training; SentencePiece uses identity normalization to avoid re-processing
Corpus quality: Mon-ratio filter (≥30% Myanmar chars), exact-duplicate removal
Integration: Used as the base for vocabulary expansion in mon-lm

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support