File size: 2,278 Bytes
45eead2 8eb50fc 45eead2 d52ee03 8eb50fc d52ee03 4128ba5 d786ff1 4128ba5 be6798e d786ff1 be6798e 4128ba5 d786ff1 4128ba5 d786ff1 4128ba5 bb4f97c 50eafb3 bb4f97c 04d6e15 bb4f97c d52ee03 8eb50fc d52ee03 8eb50fc d52ee03 8eb50fc bb4f97c d52ee03 8eb50fc d52ee03 bb4f97c d52ee03 528935d d52ee03 8eb50fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
license: mit
language:
- fa
- en
- ar
---
# Mana Tokenizer
The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.
## Quick Start
You can encode/decode your data using Mana Tokenizer like this:
```python
from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))
```
this is the normal encoding of this text:
```
[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.
```
and here is what Mana tokenizer generate:
```
[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.
```
You can also add special tokens:
```python
tokenizer.register_special_tokens({"</s>": 100269})
```
Batch encode:
```python
tokenizer.batch_encode(["یک متن طولانی"])
```
## Benchmark
- **Benchmark DateTime:** 2024-11-06 16:12:50
- **Mana Batch Encode Time:** 0.10711932182312012 seconds
- **Mana Batch Encode Memory Usage:** 13.203125 KB
- **Total characters in benchmark:** 131,000
## Special Tokens
- **user Token:** `<|user|>`
- **assistant Token:** `<|assistant|>`
- **end Token:** `<|end|>`
- **system Token:** `<|system|>`
## Statistics
- **Model Type:** BPE
- **Vocabulary Size:** 265,703
- **Character Coverage:** 99.9%
- **Total Number of Text Samples:** 1,147,036
- **Total Number of Tokens:** 1,490,338
- **Average Token Length:** 4.51
- **Corpus Size (in bytes):** 1,792,210,410
## Training Details
- **Training Data:** Mana Persian corpus
- **Training Script:** Mana Trainer
- **Script Version:** 1.2
## License
Mana tokenizer is licensed under the MIT License. |