|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# Cohere `multilingual-22-12` tokenizer |
|
|
|
|
|
This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models) |
|
|
|
|
|
You can load it with the transformers library like this: |
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12") |
|
|
text = "Hellö World, this is my input string!" |
|
|
enc = tokenizer(text) |
|
|
print("Encoded input:") |
|
|
print(enc) |
|
|
|
|
|
inv_vocab = {v: k for k, v in tokenizer.vocab.items()} |
|
|
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']] |
|
|
print("Tokens:") |
|
|
print(tokens) |
|
|
|
|
|
number_of_tokens = len(enc['input_ids']) |
|
|
print("Number of tokens:", number_of_tokens) |
|
|
``` |
|
|
|
|
|
## Computing number of tokens |
|
|
|
|
|
The following values can be used to approximate the number of tokens given the number input characters: |
|
|
``` |
|
|
approx_number_of_tokens = len(input_text) / ratio |
|
|
``` |
|
|
|
|
|
E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`. |
|
|
|
|
|
| Language | Avg. characters per token | |
|
|
| --- | :---: | |
|
|
| ar | 3.6 | |
|
|
| de | 4.6 | |
|
|
| en | 4.8 | |
|
|
| es | 4.6 | |
|
|
| fr | 4.4 | |
|
|
| hi | 3.8 | |
|
|
| it | 4.5 | |
|
|
| ja | 1.3 | |
|
|
| ko | 2.0 | |
|
|
| zh | 1.1 | |
|
|
|
|
|
|
|
|
These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change. |