| | --- |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - sentence-transformers |
| | - feature-extraction |
| | - sentence-similarity |
| | - transformers |
| | - information-retrieval |
| | language: pl |
| | license: apache-2.0 |
| | widget: |
| | - source_sentence: "query: Jak dożyć 100 lat?" |
| | sentences: |
| | - "passage: Trzeba zdrowo się odżywiać i uprawiać sport." |
| | - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami." |
| | - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." |
| |
|
| | --- |
| | |
| | <h1 align="center">MMLW-retrieval-e5-large</h1> |
| |
|
| | MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. |
| | This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors. |
| | The model was developed using a two-step procedure: |
| | - In the first step, it was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-large-en) as teacher models for distillation. |
| | - The second step involved fine-tuning the obtained models with contrastrive loss on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs. |
| |
|
| | ⚠️ **2023-12-26:** We have updated the model to a new version with improved results. You can still download the previous version using the **v1** tag: `AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-large", revision="v1")` ⚠️ |
| |
|
| | ## Usage (Sentence-Transformers) |
| |
|
| | ⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️ |
| |
|
| | You can use the model like this with [sentence-transformers](https://www.SBERT.net): |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | from sentence_transformers.util import cos_sim |
| | |
| | query_prefix = "query: " |
| | answer_prefix = "passage: " |
| | queries = [query_prefix + "Jak dożyć 100 lat?"] |
| | answers = [ |
| | answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", |
| | answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", |
| | answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." |
| | ] |
| | model = SentenceTransformer("sdadas/mmlw-retrieval-e5-large") |
| | queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) |
| | answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) |
| | |
| | best_answer = cos_sim(queries_emb, answers_emb).argmax().item() |
| | print(answers[best_answer]) |
| | # Trzeba zdrowo się odżywiać i uprawiać sport. |
| | ``` |
| |
|
| | ## Evaluation Results |
| | The model achieves **NDCG@10** of **58.30** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. |
| |
|
| | ## Acknowledgements |
| | This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @inproceedings{dadas2024pirb, |
| | title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, |
| | author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}}, |
| | booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, |
| | pages={12761--12774}, |
| | year={2024} |
| | } |
| | ``` |