tspersian commited on
Commit
d52ee03
·
1 Parent(s): 406daa9
Files changed (3) hide show
  1. README.md +39 -0
  2. special_tokens_map.json +6 -0
  3. tokenizer_config.json +9 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Mana Tokenizer
3
+
4
+ The Mana Tokenizer is a custom-trained SentencePiece tokenizer for Persian text, trained on a combination of the Persian Wikipedia and Ganjoor datasets. The tokenizer uses the Unigram model type, optimized for handling the unique characteristics of Persian text.
5
+
6
+ ## Special Tokens
7
+
8
+ - **UNK Token:** `<unk>`
9
+ - **BOS Token:** `<s>`
10
+ - **EOS Token:** `</s>`
11
+ - **PAD Token:** `<pad>`
12
+
13
+ ## Usage
14
+
15
+ You can load this tokenizer using the Hugging Face `transformers` library as follows:
16
+
17
+ ```python
18
+ from transformers import PreTrainedTokenizerFast
19
+
20
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("your-username/mana_tokenizer")
21
+
22
+ text = "این یک تست است."
23
+ encoded = tokenizer(text)
24
+ print(f"Encoded: {encoded}")
25
+
26
+ decoded = tokenizer.decode(encoded['input_ids'])
27
+ print(f"Decoded: {decoded}")
28
+
29
+
30
+
31
+ Statistics
32
+
33
+ Vocabulary Size: 199,997
34
+ Character Coverage: 99.9%
35
+ Total Number of Text Samples: 1,022,675
36
+
37
+ License
38
+
39
+ This tokenizer is licensed under the MIT License.
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "unk_token": "<unk>",
3
+ "bos_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "pad_token": "<pad>"
6
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "unigram",
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "unk_token_id": 0,
6
+ "pad_token_id": 3,
7
+ "do_lower_case": false,
8
+ "max_length": 512
9
+ }