Nehc
/

e5-large-ru

Feature Extraction

sentence-transformers

Sentence Transformers

sentence-similarity

Model card Files Files and versions

Nehc commited on Mar 2, 2024

Commit

8e6542b

·

verified ·

1 Parent(s): ea1659f

Update README.md

Files changed (1) hide show

README.md +73 -0

README.md CHANGED Viewed

@@ -1,3 +1,76 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- ru
+- en
+pipeline_tag: sentence-similarity
+tags:
+- mteb
+- Sentence Transformers
+- sentence-similarity
+- feature-extraction
+- sentence-transformers
 ---
+# E5-large-ru
+Mod of https://huggingface.co/intfloat/multilingual-e5-large.
+Shrink tokenizer to 32K (ru+en) with David's Dale [manual](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) and invaluable assistance!
+Thank you, David! 🥰
+## Support for Sentence Transformers
+Below is an example for usage with sentence_transformers.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('intfloat/multilingual-e5-large')
+input_texts = ["passage: This is an example sentence", "passage: Каждый охотник желает знать.","query: Где сидит фазан?"]
+embeddings = model.encode(input_texts, normalize_embeddings=True)
+```
+Package requirements
+`pip install sentence_transformers~=2.2.2`
+Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
+## FAQ
+**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
+Yes, this is how the model is trained, otherwise you will see a performance degradation.
+Here are some rules of thumb:
+- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
+- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
+- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
+**2. Why are my reproduced results slightly different from reported in the model card?**
+Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
+**3. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
+This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
+For text embedding tasks like text retrieval or semantic similarity,
+what matters is the relative order of the scores instead of the absolute values,
+so this should not be an issue.
+## Citation
+If you find our paper or models helpful, please consider cite as follows:
+```
+@article{wang2024multilingual,
+  title={Multilingual E5 Text Embeddings: A Technical Report},
+  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
+  journal={arXiv preprint arXiv:2402.05672},
+  year={2024}
+}
+```
+## Limitations
+Long texts will be truncated to at most 512 tokens.