janakhpon commited on
Commit
9d74fcf
·
1 Parent(s): 1f1b899

feat: simplified mon tokenizer in hf format, updated tags, resolve the legacy issue

Browse files
Files changed (2) hide show
  1. README.md +7 -4
  2. convert_to_hf.py +7 -4
README.md CHANGED
@@ -26,9 +26,10 @@ compatible with Hugging Face Transformers and the Llama tokenizer architecture.
26
  - **Language**: Mon (mnw)
27
  - **Vocabulary Size**: 4,000 tokens
28
  - **Algorithm**: SentencePiece (Unigram Language Model)
29
- - **Tokenizer Type**: LlamaTokenizer
30
  - **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
31
  - **Context Length**: 4,096 tokens
 
32
 
33
  ## Usage
34
 
@@ -49,11 +50,12 @@ print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠ
49
 
50
  ## Technical Specifications
51
 
52
- - **Tokenizer Class**: `LlamaTokenizer`
53
  - **Vocabulary Type**: Subword tokenization using SentencePiece
54
  - **Training Algorithm**: Unigram Language Model
55
  - **OOV Handling**: `<unk>` token for unknown words
56
  - **Legacy Mode**: Enabled for maximum compatibility
 
57
 
58
  ## Training Data
59
 
@@ -71,6 +73,7 @@ Total training data: Not specified
71
  - **Coverage**: High coverage of Mon language vocabulary
72
  - **Efficiency**: Optimized for Mon language morphology
73
  - **Compatibility**: Full compatibility with Transformers 4.x
 
74
 
75
  ## License
76
 
@@ -81,10 +84,10 @@ This tokenizer is released under the MIT License.
81
  If you use this tokenizer in your research, please cite:
82
 
83
  ```bibtex
84
- @misc{mon_tokenizer_2024,
85
  title={Mon Language Tokenizer for Hugging Face Transformers},
86
  author={Mon Language Project},
87
- year={2024},
88
  url={https://huggingface.co/janakhpon/mon_tokenizer}
89
  }
90
  ```
 
26
  - **Language**: Mon (mnw)
27
  - **Vocabulary Size**: 4,000 tokens
28
  - **Algorithm**: SentencePiece (Unigram Language Model)
29
+ - **Tokenizer Type**: LlamaTokenizerFast
30
  - **Special Tokens**: `<s>`, `</s>`, `<unk>`, `<pad>`
31
  - **Context Length**: 4,096 tokens
32
+ - **Updated**: August 31, 2025
33
 
34
  ## Usage
35
 
 
50
 
51
  ## Technical Specifications
52
 
53
+ - **Tokenizer Class**: `LlamaTokenizerFast`
54
  - **Vocabulary Type**: Subword tokenization using SentencePiece
55
  - **Training Algorithm**: Unigram Language Model
56
  - **OOV Handling**: `<unk>` token for unknown words
57
  - **Legacy Mode**: Enabled for maximum compatibility
58
+ - **Fast Tokenizer**: Includes tokenizer.json for optimal performance
59
 
60
  ## Training Data
61
 
 
73
  - **Coverage**: High coverage of Mon language vocabulary
74
  - **Efficiency**: Optimized for Mon language morphology
75
  - **Compatibility**: Full compatibility with Transformers 4.x
76
+ - **Speed**: Fast tokenizer for improved performance
77
 
78
  ## License
79
 
 
84
  If you use this tokenizer in your research, please cite:
85
 
86
  ```bibtex
87
+ @misc{mon_tokenizer_2025,
88
  title={Mon Language Tokenizer for Hugging Face Transformers},
89
  author={Mon Language Project},
90
+ year={2025},
91
  url={https://huggingface.co/janakhpon/mon_tokenizer}
92
  }
93
  ```
convert_to_hf.py CHANGED
@@ -262,9 +262,10 @@ compatible with Hugging Face Transformers and the Llama tokenizer architecture.
262
  - **Language**: Mon (mnw)
263
  - **Vocabulary Size**: {analysis["vocab_size"]:,} tokens
264
  - **Algorithm**: SentencePiece (Unigram Language Model)
265
- - **Tokenizer Type**: LlamaTokenizer
266
  - **Special Tokens**: `{analysis["bos_token"]}`, `{analysis["eos_token"]}`, `{analysis["unk_token"]}`, `{analysis["pad_token"]}`
267
  - **Context Length**: 4,096 tokens
 
268
 
269
  ## Usage
270
 
@@ -285,11 +286,12 @@ print(decoded) # ဘာသာမန် ပရူပရာတံဂှ် ကၠ
285
 
286
  ## Technical Specifications
287
 
288
- - **Tokenizer Class**: `LlamaTokenizer`
289
  - **Vocabulary Type**: Subword tokenization using SentencePiece
290
  - **Training Algorithm**: Unigram Language Model
291
  - **OOV Handling**: `{analysis["unk_token"]}` token for unknown words
292
  - **Legacy Mode**: Enabled for maximum compatibility
 
293
 
294
  ## Training Data
295
 
@@ -307,6 +309,7 @@ Total training data: {training_data_info.get('total_size', 'Not specified')}
307
  - **Coverage**: High coverage of Mon language vocabulary
308
  - **Efficiency**: Optimized for Mon language morphology
309
  - **Compatibility**: Full compatibility with Transformers 4.x
 
310
 
311
  ## License
312
 
@@ -317,10 +320,10 @@ This tokenizer is released under the MIT License.
317
  If you use this tokenizer in your research, please cite:
318
 
319
  ```bibtex
320
- @misc{{mon_tokenizer_2024,
321
  title={{Mon Language Tokenizer for Hugging Face Transformers}},
322
  author={{Mon Language Project}},
323
- year={{2024}},
324
  url={{https://huggingface.co/janakhpon/mon_tokenizer}}
325
  }}
326
  ```
 
262
  - **Language**: Mon (mnw)
263
  - **Vocabulary Size**: {analysis["vocab_size"]:,} tokens
264
  - **Algorithm**: SentencePiece (Unigram Language Model)
265
+ - **Tokenizer Type**: LlamaTokenizerFast
266
  - **Special Tokens**: `{analysis["bos_token"]}`, `{analysis["eos_token"]}`, `{analysis["unk_token"]}`, `{analysis["pad_token"]}`
267
  - **Context Length**: 4,096 tokens
268
+ - **Updated**: August 31, 2025
269
 
270
  ## Usage
271
 
 
286
 
287
  ## Technical Specifications
288
 
289
+ - **Tokenizer Class**: `LlamaTokenizerFast`
290
  - **Vocabulary Type**: Subword tokenization using SentencePiece
291
  - **Training Algorithm**: Unigram Language Model
292
  - **OOV Handling**: `{analysis["unk_token"]}` token for unknown words
293
  - **Legacy Mode**: Enabled for maximum compatibility
294
+ - **Fast Tokenizer**: Includes tokenizer.json for optimal performance
295
 
296
  ## Training Data
297
 
 
309
  - **Coverage**: High coverage of Mon language vocabulary
310
  - **Efficiency**: Optimized for Mon language morphology
311
  - **Compatibility**: Full compatibility with Transformers 4.x
312
+ - **Speed**: Fast tokenizer for improved performance
313
 
314
  ## License
315
 
 
320
  If you use this tokenizer in your research, please cite:
321
 
322
  ```bibtex
323
+ @misc{{mon_tokenizer_2025,
324
  title={{Mon Language Tokenizer for Hugging Face Transformers}},
325
  author={{Mon Language Project}},
326
+ year={{2025}},
327
  url={{https://huggingface.co/janakhpon/mon_tokenizer}}
328
  }}
329
  ```