Mon Language Tokenizer

A SentencePiece Unigram tokenizer for the Mon language (mnw).

Performance

Trained on 41.4M Mon-related characters (within a 92.8M total character / 176.7M byte raw corpus) across 8,841 documents.

Metric Result
Vocabulary size 32,000
Avg compression 5.22 chars/token
Round-trip accuracy 100%
Byte-fallback rate 0.00%
Model size 977 KB

Previous version (v0.1, 4k vocab, 2.4M chars): ~1.99 chars/token. This release is 2.6Γ— more efficient.

Usage

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("janakhpon/mon_tokenizer")

text = "မန်တဢဂှ် α€€α α€±α€¬α€”α€Ία€—α€’α€Ύα€Ία€œα€α€Ία€›α‹"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens, skip_special_tokens=True)
print(decoded)  # မန်တဢဂှ် α€€α α€±α€¬α€”α€Ία€—α€’α€Ύα€Ία€œα€α€Ία€›α‹

Or via the standalone Python package:

pip install mon-tokenizer
from mon_tokenizer import MonTokenizer
tok = MonTokenizer()
result = tok.encode("မန်တဢဂှ် α€€α α€±α€¬α€”α€Ία€—α€’α€Ύα€Ία€œα€α€Ία€›α‹")
print(result["pieces"])

Special tokens

Token ID
<unk> 0
<s> 1
</s> 2
<mask> 3
<sep> 4
<cls> 5
<pad> 32000

Design

  • Algorithm: SentencePiece Unigram β€” superior to BPE for Mon's agglutinative morphology
  • Grapheme atomicity: Myanmar syllable components injected as user_defined_symbols so the tokenizer never splits a syllable mid-stack
  • Normalization: NFC applied before training; SentencePiece uses identity normalization to avoid re-processing
  • Corpus quality: Mon-ratio filter (β‰₯30% Myanmar chars), exact-duplicate removal
  • Integration: Used as the base for vocabulary expansion in mon-lm

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support