CohereLabs
/

multilingual-22-12

Model card Files Files and versions

multilingual-22-12 / README.md

alexrs's picture

Update README.md

c22a3a3 verified 20 days ago

|

history blame contribute delete

1.4 kB

	---
	license: apache-2.0
	---

	# Cohere `multilingual-22-12` tokenizer

	This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models)

	You can load it with the transformers library like this:
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("CohereLabs/multilingual-22-12")
	text = "Hellö World, this is my input string!"
	enc = tokenizer(text)
	print("Encoded input:")
	print(enc)

	inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
	tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
	print("Tokens:")
	print(tokens)

	number_of_tokens = len(enc['input_ids'])
	print("Number of tokens:", number_of_tokens)
	```

	## Computing number of tokens

	The following values can be used to approximate the number of tokens given the number input characters:
	```
	approx_number_of_tokens = len(input_text) / ratio
	```

	E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.

	\| Language \| Avg. characters per token \|
	\| --- \| :---: \|
	\| ar \| 3.6 \|
	\| de \| 4.6 \|
	\| en \| 4.8 \|
	\| es \| 4.6 \|
	\| fr \| 4.4 \|
	\| hi \| 3.8 \|
	\| it \| 4.5 \|
	\| ja \| 1.3 \|
	\| ko \| 2.0 \|
	\| zh \| 1.1 \|


	These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.