|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- fa |
|
|
- en |
|
|
- ar |
|
|
--- |
|
|
|
|
|
# Mana Tokenizer |
|
|
|
|
|
The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text. |
|
|
|
|
|
## Quick Start |
|
|
You can encode/decode your data using Mana Tokenizer like this: |
|
|
```python |
|
|
from mana_tokenizer import ManaTokenizer |
|
|
tokenizer = ManaTokenizer() |
|
|
text = "سلام من یک متن تست برای تست این تست هستم." |
|
|
print(tokenizer.encode(text)) |
|
|
print(tokenizer.decode(tokenizer.encode(text))) |
|
|
``` |
|
|
this is the normal encoding of this text: |
|
|
``` |
|
|
[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46] |
|
|
سلام من یک متن تست برای تست این تست هستم. |
|
|
``` |
|
|
and here is what Mana tokenizer generate: |
|
|
``` |
|
|
[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46] |
|
|
سلام من یک متن تست برای تست این تست هستم. |
|
|
``` |
|
|
|
|
|
You can also add special tokens: |
|
|
```python |
|
|
tokenizer.register_special_tokens({"</s>": 100269}) |
|
|
``` |
|
|
|
|
|
Batch encode: |
|
|
```python |
|
|
tokenizer.batch_encode(["یک متن طولانی"]) |
|
|
``` |
|
|
|
|
|
## Benchmark |
|
|
|
|
|
- **Benchmark DateTime:** 2024-11-06 16:12:50 |
|
|
- **Mana Batch Encode Time:** 0.10711932182312012 seconds |
|
|
- **Mana Batch Encode Memory Usage:** 13.203125 KB |
|
|
- **Total characters in benchmark:** 131,000 |
|
|
|
|
|
## Special Tokens |
|
|
|
|
|
- **user Token:** `<|user|>` |
|
|
- **assistant Token:** `<|assistant|>` |
|
|
- **end Token:** `<|end|>` |
|
|
- **system Token:** `<|system|>` |
|
|
|
|
|
## Statistics |
|
|
|
|
|
- **Model Type:** BPE |
|
|
- **Vocabulary Size:** 265,703 |
|
|
- **Character Coverage:** 99.9% |
|
|
- **Total Number of Text Samples:** 1,147,036 |
|
|
- **Total Number of Tokens:** 1,490,338 |
|
|
- **Average Token Length:** 4.51 |
|
|
- **Corpus Size (in bytes):** 1,792,210,410 |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training Data:** Mana Persian corpus |
|
|
- **Training Script:** Mana Trainer |
|
|
- **Script Version:** 1.2 |
|
|
|
|
|
## License |
|
|
|
|
|
Mana tokenizer is licensed under the MIT License. |