feat: simplified mon tokenizer in hf format, updated tags, resolve the legacy issue
Browse files- README.md +7 -4
- convert_to_hf.py +7 -4
README.md
CHANGED
|
@@ -26,9 +26,10 @@ compatible with Hugging Face Transformers and the Llama tokenizer architecture.
|
|
| 26 |
- **Language**: Mon (mnw)
|
| 27 |
- **Vocabulary Size**: 4,000 tokens
|
| 28 |
- **Algorithm**: SentencePiece (Unigram Language Model)
|
| 29 |
-
- **Tokenizer Type**:
|
| 30 |
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
|
| 31 |
- **Context Length**: 4,096 tokens
|
|
|
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
|
@@ -49,11 +50,12 @@ print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠ
|
|
| 49 |
|
| 50 |
## Technical Specifications
|
| 51 |
|
| 52 |
-
- **Tokenizer Class**: `
|
| 53 |
- **Vocabulary Type**: Subword tokenization using SentencePiece
|
| 54 |
- **Training Algorithm**: Unigram Language Model
|
| 55 |
- **OOV Handling**: `<unk>` token for unknown words
|
| 56 |
- **Legacy Mode**: Enabled for maximum compatibility
|
|
|
|
| 57 |
|
| 58 |
## Training Data
|
| 59 |
|
|
@@ -71,6 +73,7 @@ Total training data: Not specified
|
|
| 71 |
- **Coverage**: High coverage of Mon language vocabulary
|
| 72 |
- **Efficiency**: Optimized for Mon language morphology
|
| 73 |
- **Compatibility**: Full compatibility with Transformers 4.x
|
|
|
|
| 74 |
|
| 75 |
## License
|
| 76 |
|
|
@@ -81,10 +84,10 @@ This tokenizer is released under the MIT License.
|
|
| 81 |
If you use this tokenizer in your research, please cite:
|
| 82 |
|
| 83 |
```bibtex
|
| 84 |
-
@misc{
|
| 85 |
title={Mon Language Tokenizer for Hugging Face Transformers},
|
| 86 |
author={Mon Language Project},
|
| 87 |
-
year={
|
| 88 |
url={https://huggingface.co/janakhpon/mon_tokenizer}
|
| 89 |
}
|
| 90 |
```
|
|
|
|
| 26 |
- **Language**: Mon (mnw)
|
| 27 |
- **Vocabulary Size**: 4,000 tokens
|
| 28 |
- **Algorithm**: SentencePiece (Unigram Language Model)
|
| 29 |
+
- **Tokenizer Type**: LlamaTokenizerFast
|
| 30 |
- **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
|
| 31 |
- **Context Length**: 4,096 tokens
|
| 32 |
+
- **Updated**: August 31, 2025
|
| 33 |
|
| 34 |
## Usage
|
| 35 |
|
|
|
|
| 50 |
|
| 51 |
## Technical Specifications
|
| 52 |
|
| 53 |
+
- **Tokenizer Class**: `LlamaTokenizerFast`
|
| 54 |
- **Vocabulary Type**: Subword tokenization using SentencePiece
|
| 55 |
- **Training Algorithm**: Unigram Language Model
|
| 56 |
- **OOV Handling**: `<unk>` token for unknown words
|
| 57 |
- **Legacy Mode**: Enabled for maximum compatibility
|
| 58 |
+
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance
|
| 59 |
|
| 60 |
## Training Data
|
| 61 |
|
|
|
|
| 73 |
- **Coverage**: High coverage of Mon language vocabulary
|
| 74 |
- **Efficiency**: Optimized for Mon language morphology
|
| 75 |
- **Compatibility**: Full compatibility with Transformers 4.x
|
| 76 |
+
- **Speed**: Fast tokenizer for improved performance
|
| 77 |
|
| 78 |
## License
|
| 79 |
|
|
|
|
| 84 |
If you use this tokenizer in your research, please cite:
|
| 85 |
|
| 86 |
```bibtex
|
| 87 |
+
@misc{mon_tokenizer_2025,
|
| 88 |
title={Mon Language Tokenizer for Hugging Face Transformers},
|
| 89 |
author={Mon Language Project},
|
| 90 |
+
year={2025},
|
| 91 |
url={https://huggingface.co/janakhpon/mon_tokenizer}
|
| 92 |
}
|
| 93 |
```
|
convert_to_hf.py
CHANGED
|
@@ -262,9 +262,10 @@ compatible with Hugging Face Transformers and the Llama tokenizer architecture.
|
|
| 262 |
- **Language**: Mon (mnw)
|
| 263 |
- **Vocabulary Size**: {analysis["vocab_size"]:,} tokens
|
| 264 |
- **Algorithm**: SentencePiece (Unigram Language Model)
|
| 265 |
-
- **Tokenizer Type**:
|
| 266 |
- **Special Tokens**: `{analysis["bos_token"]}`, `{analysis["eos_token"]}`, `{analysis["unk_token"]}`, `{analysis["pad_token"]}`
|
| 267 |
- **Context Length**: 4,096 tokens
|
|
|
|
| 268 |
|
| 269 |
## Usage
|
| 270 |
|
|
@@ -285,11 +286,12 @@ print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠ
|
|
| 285 |
|
| 286 |
## Technical Specifications
|
| 287 |
|
| 288 |
-
- **Tokenizer Class**: `
|
| 289 |
- **Vocabulary Type**: Subword tokenization using SentencePiece
|
| 290 |
- **Training Algorithm**: Unigram Language Model
|
| 291 |
- **OOV Handling**: `{analysis["unk_token"]}` token for unknown words
|
| 292 |
- **Legacy Mode**: Enabled for maximum compatibility
|
|
|
|
| 293 |
|
| 294 |
## Training Data
|
| 295 |
|
|
@@ -307,6 +309,7 @@ Total training data: {training_data_info.get('total_size', 'Not specified')}
|
|
| 307 |
- **Coverage**: High coverage of Mon language vocabulary
|
| 308 |
- **Efficiency**: Optimized for Mon language morphology
|
| 309 |
- **Compatibility**: Full compatibility with Transformers 4.x
|
|
|
|
| 310 |
|
| 311 |
## License
|
| 312 |
|
|
@@ -317,10 +320,10 @@ This tokenizer is released under the MIT License.
|
|
| 317 |
If you use this tokenizer in your research, please cite:
|
| 318 |
|
| 319 |
```bibtex
|
| 320 |
-
@misc{{
|
| 321 |
title={{Mon Language Tokenizer for Hugging Face Transformers}},
|
| 322 |
author={{Mon Language Project}},
|
| 323 |
-
year={{
|
| 324 |
url={{https://huggingface.co/janakhpon/mon_tokenizer}}
|
| 325 |
}}
|
| 326 |
```
|
|
|
|
| 262 |
- **Language**: Mon (mnw)
|
| 263 |
- **Vocabulary Size**: {analysis["vocab_size"]:,} tokens
|
| 264 |
- **Algorithm**: SentencePiece (Unigram Language Model)
|
| 265 |
+
- **Tokenizer Type**: LlamaTokenizerFast
|
| 266 |
- **Special Tokens**: `{analysis["bos_token"]}`, `{analysis["eos_token"]}`, `{analysis["unk_token"]}`, `{analysis["pad_token"]}`
|
| 267 |
- **Context Length**: 4,096 tokens
|
| 268 |
+
- **Updated**: August 31, 2025
|
| 269 |
|
| 270 |
## Usage
|
| 271 |
|
|
|
|
| 286 |
|
| 287 |
## Technical Specifications
|
| 288 |
|
| 289 |
+
- **Tokenizer Class**: `LlamaTokenizerFast`
|
| 290 |
- **Vocabulary Type**: Subword tokenization using SentencePiece
|
| 291 |
- **Training Algorithm**: Unigram Language Model
|
| 292 |
- **OOV Handling**: `{analysis["unk_token"]}` token for unknown words
|
| 293 |
- **Legacy Mode**: Enabled for maximum compatibility
|
| 294 |
+
- **Fast Tokenizer**: Includes tokenizer.json for optimal performance
|
| 295 |
|
| 296 |
## Training Data
|
| 297 |
|
|
|
|
| 309 |
- **Coverage**: High coverage of Mon language vocabulary
|
| 310 |
- **Efficiency**: Optimized for Mon language morphology
|
| 311 |
- **Compatibility**: Full compatibility with Transformers 4.x
|
| 312 |
+
- **Speed**: Fast tokenizer for improved performance
|
| 313 |
|
| 314 |
## License
|
| 315 |
|
|
|
|
| 320 |
If you use this tokenizer in your research, please cite:
|
| 321 |
|
| 322 |
```bibtex
|
| 323 |
+
@misc{{mon_tokenizer_2025,
|
| 324 |
title={{Mon Language Tokenizer for Hugging Face Transformers}},
|
| 325 |
author={{Mon Language Project}},
|
| 326 |
+
year={{2025}},
|
| 327 |
url={{https://huggingface.co/janakhpon/mon_tokenizer}}
|
| 328 |
}}
|
| 329 |
```
|