File size: 1,398 Bytes
5dad820
 
 
f19f09a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
01caffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
---

# Cohere `multilingual-22-12` tokenizer

This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models)

You can load it with the transformers library like this:
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)

inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)

number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)
```

## Computing number of tokens

The following values can be used to approximate the number of tokens given the number input characters:
```
approx_number_of_tokens = len(input_text) / ratio
```

E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.

| Language | Avg. characters per token |
| --- | :---: |
| ar | 3.6 |
| de | 4.6 |
| en | 4.8 |
| es | 4.6 |
| fr | 4.4 |
| hi | 3.8 |
| it | 4.5 |
| ja | 1.3 |
| ko | 2.0 |
| zh | 1.1 |


These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.