Mon Language Tokenizer
A SentencePiece Unigram tokenizer for the Mon language (mnw).
Performance
Trained on 41.4M Mon-related characters (within a 92.8M total character / 176.7M byte raw corpus) across 8,841 documents.
| Metric | Result |
|---|---|
| Vocabulary size | 32,000 |
| Avg compression | 5.22 chars/token |
| Round-trip accuracy | 100% |
| Byte-fallback rate | 0.00% |
| Model size | 977 KB |
Previous version (v0.1, 4k vocab, 2.4M chars): ~1.99 chars/token. This release is 2.6Γ more efficient.
Usage
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("janakhpon/mon_tokenizer")
text = "αααΊααΆααΎαΊ αα α±α¬ααΊαααΎαΊαααΊαα"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens, skip_special_tokens=True)
print(decoded) # αααΊααΆααΎαΊ αα α±α¬ααΊαααΎαΊαααΊαα
Or via the standalone Python package:
pip install mon-tokenizer
from mon_tokenizer import MonTokenizer
tok = MonTokenizer()
result = tok.encode("αααΊααΆααΎαΊ αα α±α¬ααΊαααΎαΊαααΊαα")
print(result["pieces"])
Special tokens
| Token | ID |
|---|---|
<unk> |
0 |
<s> |
1 |
</s> |
2 |
<mask> |
3 |
<sep> |
4 |
<cls> |
5 |
<pad> |
32000 |
Design
- Algorithm: SentencePiece Unigram β superior to BPE for Mon's agglutinative morphology
- Grapheme atomicity: Myanmar syllable components injected as
user_defined_symbolsso the tokenizer never splits a syllable mid-stack - Normalization: NFC applied before training; SentencePiece uses identity normalization to avoid re-processing
- Corpus quality: Mon-ratio filter (β₯30% Myanmar chars), exact-duplicate removal
- Integration: Used as the base for vocabulary expansion in mon-lm
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support