File size: 2,645 Bytes
2ded9d3
 
bece7ec
2ded9d3
 
 
 
 
bece7ec
2ded9d3
 
81cf36d
 
 
 
 
2ded9d3
 
81cf36d
2ded9d3
81cf36d
 
2ded9d3
81cf36d
 
 
 
 
9d74fcf
81cf36d
 
9d74fcf
81cf36d
 
2ded9d3
 
 
 
9ed3203
2ded9d3
 
81cf36d
 
2ded9d3
9ed3203
81cf36d
 
 
 
 
 
9ed3203
9d74fcf
81cf36d
 
 
 
9d74fcf
9ed3203
81cf36d
 
 
 
 
 
 
 
2ded9d3
81cf36d
2ded9d3
81cf36d
 
 
 
 
9d74fcf
81cf36d
 
 
 
 
 
 
 
 
 
9d74fcf
81cf36d
 
9d74fcf
81cf36d
 
 
2ded9d3
81cf36d
2ded9d3
81cf36d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
language:
- mnw
library_name: transformers
license: mit
tags:
- tokenizer
- mon
- mnw
- myanmar
- sentencepiece
- llama
pipeline_tag: text-generation
widget:
- text: "ဘာသာမန် ပရူပရာတံဂှ်"
  example_title: "Mon Language Example"
---

# Mon Language Tokenizer

A high-quality SentencePiece tokenizer for the Mon language (mnw) with 4,000 tokens, 
compatible with Hugging Face Transformers and the Llama tokenizer architecture.

## Model Details

- **Language**: Mon (mnw)
- **Vocabulary Size**: 4,000 tokens
- **Algorithm**: SentencePiece (Unigram Language Model)
- **Tokenizer Type**: LlamaTokenizerFast
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
- **Context Length**: 4,096 tokens
- **Updated**: August 31, 2025

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("janakhpon/mon_tokenizer")

# Tokenize Mon text
text = "ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။"
tokens = tokenizer(text, return_tensors="pt")

# Decode tokens back to text
decoded = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
print(decoded)  # ဘာသာမန် ပရူပရာတံဂှ် ကၠောန်ဗဒှ်လဝ်ရ။
```

## Technical Specifications

- **Tokenizer Class**: `LlamaTokenizerFast`
- **Vocabulary Type**: Subword tokenization using SentencePiece
- **Training Algorithm**: Unigram Language Model
- **OOV Handling**: `<unk>` token for unknown words
- **Legacy Mode**: Enabled for maximum compatibility
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance

## Training Data

The tokenizer was trained on a comprehensive Mon language corpus including:

- Wikipedia articles in Mon language
- News articles and publications
- Literary works and traditional texts
- Modern digital content

Total training data: Not specified

## Performance

- **Coverage**: High coverage of Mon language vocabulary
- **Efficiency**: Optimized for Mon language morphology
- **Compatibility**: Full compatibility with Transformers 4.x
- **Speed**: Fast tokenizer for improved performance

## License

This tokenizer is released under the MIT License.

## Citation

If you use this tokenizer in your research, please cite:

```bibtex
@misc{mon_tokenizer_2025,
  title={Mon Language Tokenizer for Hugging Face Transformers},
  author={Mon Language Project},
  year={2025},
  url={https://huggingface.co/janakhpon/mon_tokenizer}
}
```

## Contact

For questions or issues, please open an issue on the repository or contact the maintainers.