File size: 2,278 Bytes
45eead2
 
 
 
8eb50fc
 
45eead2
d52ee03
 
 
8eb50fc
d52ee03
4128ba5
d786ff1
4128ba5
 
 
 
 
 
 
be6798e
d786ff1
 
 
 
be6798e
 
 
 
 
4128ba5
d786ff1
4128ba5
 
 
d786ff1
4128ba5
 
 
 
 
bb4f97c
 
50eafb3
bb4f97c
 
04d6e15
bb4f97c
d52ee03
 
8eb50fc
 
 
 
d52ee03
8eb50fc
d52ee03
8eb50fc
 
 
bb4f97c
 
 
 
d52ee03
8eb50fc
d52ee03
bb4f97c
 
 
d52ee03
528935d
d52ee03
8eb50fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
language:
- fa
- en
- ar
---

# Mana Tokenizer

The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.

## Quick Start
You can encode/decode your data using Mana Tokenizer like this:
```python
from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))
```
this is the normal encoding of this text:
```
[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.
```
and here is what Mana tokenizer generate:
```
[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.
```

You can also add special tokens:
```python
tokenizer.register_special_tokens({"</s>": 100269})
```

Batch encode:
```python
tokenizer.batch_encode(["یک متن طولانی"])
```

## Benchmark

- **Benchmark DateTime:** 2024-11-06 16:12:50
- **Mana Batch Encode Time:** 0.10711932182312012 seconds
- **Mana Batch Encode Memory Usage:** 13.203125 KB
- **Total characters in benchmark:** 131,000

## Special Tokens

- **user Token:** `<|user|>`
- **assistant Token:** `<|assistant|>`
- **end Token:** `<|end|>`
- **system Token:** `<|system|>`

## Statistics

- **Model Type:** BPE
- **Vocabulary Size:** 265,703
- **Character Coverage:** 99.9%
- **Total Number of Text Samples:** 1,147,036
- **Total Number of Tokens:** 1,490,338
- **Average Token Length:** 4.51
- **Corpus Size (in bytes):** 1,792,210,410

## Training Details

- **Training Data:** Mana Persian corpus
- **Training Script:** Mana Trainer
- **Script Version:** 1.2

## License

Mana tokenizer is licensed under the MIT License.