Benchmark
Browse files
README.md
CHANGED
|
@@ -40,6 +40,13 @@ Batch encode:
|
|
| 40 |
tokenizer.batch_encode(["یک متن طولانی"])
|
| 41 |
```
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
## Special Tokens
|
| 44 |
|
| 45 |
- **user Token:** `<|user|>`
|
|
@@ -52,16 +59,16 @@ tokenizer.batch_encode(["یک متن طولانی"])
|
|
| 52 |
- **Model Type:** BPE
|
| 53 |
- **Vocabulary Size:** 265,703
|
| 54 |
- **Character Coverage:** 99.9%
|
| 55 |
-
- **Total Number of Text Samples
|
| 56 |
-
- **Total Number of Tokens
|
| 57 |
-
- **Average Token Length
|
| 58 |
-
- **Corpus Size (in bytes)
|
| 59 |
|
| 60 |
## Training Details
|
| 61 |
|
| 62 |
-
- **Training Data
|
| 63 |
-
- **Training Script
|
| 64 |
-
- **Script Version
|
| 65 |
|
| 66 |
## License
|
| 67 |
|
|
|
|
| 40 |
tokenizer.batch_encode(["یک متن طولانی"])
|
| 41 |
```
|
| 42 |
|
| 43 |
+
## Benchmark
|
| 44 |
+
|
| 45 |
+
- **Current Date and Time:** 2024-11-06 16:12:50
|
| 46 |
+
- **Mana Batch Encode Time:** 0.10711932182312012 seconds
|
| 47 |
+
- **Mana Batch Encode Memory Usage:** 13.203125 KB
|
| 48 |
+
- **Total characters in large_texts:** 131000
|
| 49 |
+
|
| 50 |
## Special Tokens
|
| 51 |
|
| 52 |
- **user Token:** `<|user|>`
|
|
|
|
| 59 |
- **Model Type:** BPE
|
| 60 |
- **Vocabulary Size:** 265,703
|
| 61 |
- **Character Coverage:** 99.9%
|
| 62 |
+
- **Total Number of Text Samples:** 1,147,036
|
| 63 |
+
- **Total Number of Tokens:** 1,490,338
|
| 64 |
+
- **Average Token Length:** 4.51
|
| 65 |
+
- **Corpus Size (in bytes):** 1,792,210,410
|
| 66 |
|
| 67 |
## Training Details
|
| 68 |
|
| 69 |
+
- **Training Data:** Mana Persian corpus
|
| 70 |
+
- **Training Script:** Mana Trainer
|
| 71 |
+
- **Script Version:** 1.2
|
| 72 |
|
| 73 |
## License
|
| 74 |
|