| --- |
| license: apache-2.0 |
| --- |
| |
| # Cohere `multilingual-22-12` tokenizer |
|
|
| This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models) |
|
|
| You can load it with the transformers library like this: |
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("CohereLabs/multilingual-22-12") |
| text = "Hellö World, this is my input string!" |
| enc = tokenizer(text) |
| print("Encoded input:") |
| print(enc) |
| |
| inv_vocab = {v: k for k, v in tokenizer.vocab.items()} |
| tokens = [inv_vocab[token_id] for token_id in enc['input_ids']] |
| print("Tokens:") |
| print(tokens) |
| |
| number_of_tokens = len(enc['input_ids']) |
| print("Number of tokens:", number_of_tokens) |
| ``` |
|
|
| ## Computing number of tokens |
|
|
| The following values can be used to approximate the number of tokens given the number input characters: |
| ``` |
| approx_number_of_tokens = len(input_text) / ratio |
| ``` |
|
|
| E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`. |
|
|
| | Language | Avg. characters per token | |
| | --- | :---: | |
| | ar | 3.6 | |
| | de | 4.6 | |
| | en | 4.8 | |
| | es | 4.6 | |
| | fr | 4.4 | |
| | hi | 3.8 | |
| | it | 4.5 | |
| | ja | 1.3 | |
| | ko | 2.0 | |
| | zh | 1.1 | |
|
|
|
|
| These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change. |