File size: 1,398 Bytes
5dad820 f19f09a 01caffa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
license: apache-2.0
---
# Cohere `multilingual-22-12` tokenizer
This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models)
You can load it with the transformers library like this:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)
inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)
number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)
```
## Computing number of tokens
The following values can be used to approximate the number of tokens given the number input characters:
```
approx_number_of_tokens = len(input_text) / ratio
```
E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.
| Language | Avg. characters per token |
| --- | :---: |
| ar | 3.6 |
| de | 4.6 |
| en | 4.8 |
| es | 4.6 |
| fr | 4.4 |
| hi | 3.8 |
| it | 4.5 |
| ja | 1.3 |
| ko | 2.0 |
| zh | 1.1 |
These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change. |