niobures
/

E5

Model card Files Files and versions

E5 / ru-e5-base /README.md

niobures's picture

ru-e5

cf1a60c verified 3 months ago

|

history blame contribute delete

1.18 kB

	---
	library_name: transformers
	language:
	- ru
	- uk
	- kk
	- be
	---

	## About model creation

	This is a smaller version of the intfloat/multilingual-e5-base with only some Russian (Cyrillic in general) and English (fever) tokens (and embeddings) left.

	The model created in a similar way as described in this https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90 post.

	The CulturaX dataset was used to search for the required tokens. As a result, out of 250k tokens of the original model, only 69,382 required were left.

	## Was the model trained in any way?

	No. The tokenizer has been modified, and all changes to token identifiers have been corrected by moving embeddings in the model word_embeddings module to their new places, so the quality of this model on Cyrilic (and English) is exactly the same as the original one.

	## Why do we need this?

	This allows you to use significantly less memory during training and also greatly reduces the weight of the model.

	## Authors
	- Sergei Bratchikov (https://t.me/nlpwanderer)