add stats
Browse files
README.md
CHANGED
|
@@ -23,4 +23,29 @@ print(tokens)
|
|
| 23 |
|
| 24 |
number_of_tokens = len(enc['input_ids'])
|
| 25 |
print("Number of tokens:", number_of_tokens)
|
| 26 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
number_of_tokens = len(enc['input_ids'])
|
| 25 |
print("Number of tokens:", number_of_tokens)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Computing number of tokens
|
| 29 |
+
|
| 30 |
+
The following values can be used to approximate the number of tokens given the number input characters:
|
| 31 |
+
```
|
| 32 |
+
approx_number_of_tokens = len(input_text) / ratio
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.
|
| 36 |
+
|
| 37 |
+
| Language | Avg. characters per token |
|
| 38 |
+
| --- | :---: |
|
| 39 |
+
| ar | 3.6 |
|
| 40 |
+
| de | 4.6 |
|
| 41 |
+
| en | 4.8 |
|
| 42 |
+
| es | 4.6 |
|
| 43 |
+
| fr | 4.4 |
|
| 44 |
+
| hi | 3.8 |
|
| 45 |
+
| it | 4.5 |
|
| 46 |
+
| ja | 1.3 |
|
| 47 |
+
| ko | 2.0 |
|
| 48 |
+
| zh | 1.1 |
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.
|