mana_tokenizer / README.md
tspersian's picture
regex
04d6e15
---
license: mit
language:
- fa
- en
- ar
---
# Mana Tokenizer
The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.
## Quick Start
You can encode/decode your data using Mana Tokenizer like this:
```python
from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))
```
this is the normal encoding of this text:
```
[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.
```
and here is what Mana tokenizer generate:
```
[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.
```
You can also add special tokens:
```python
tokenizer.register_special_tokens({"</s>": 100269})
```
Batch encode:
```python
tokenizer.batch_encode(["یک متن طولانی"])
```
## Benchmark
- **Benchmark DateTime:** 2024-11-06 16:12:50
- **Mana Batch Encode Time:** 0.10711932182312012 seconds
- **Mana Batch Encode Memory Usage:** 13.203125 KB
- **Total characters in benchmark:** 131,000
## Special Tokens
- **user Token:** `<|user|>`
- **assistant Token:** `<|assistant|>`
- **end Token:** `<|end|>`
- **system Token:** `<|system|>`
## Statistics
- **Model Type:** BPE
- **Vocabulary Size:** 265,703
- **Character Coverage:** 99.9%
- **Total Number of Text Samples:** 1,147,036
- **Total Number of Tokens:** 1,490,338
- **Average Token Length:** 4.51
- **Corpus Size (in bytes):** 1,792,210,410
## Training Details
- **Training Data:** Mana Persian corpus
- **Training Script:** Mana Trainer
- **Script Version:** 1.2
## License
Mana tokenizer is licensed under the MIT License.