| --- |
| pipeline_tag: sentence-similarity |
| tags: |
| - sentence-transformers |
| - feature-extraction |
| - sentence-similarity |
| - transformers |
| - information-retrieval |
| language: pl |
| license: apache-2.0 |
| widget: |
| - source_sentence: "query: Jak dożyć 100 lat?" |
| sentences: |
| - "passage: Trzeba zdrowo się odżywiać i uprawiać sport." |
| - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami." |
| - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." |
|
|
| --- |
| |
| <h1 align="center">MMLW-retrieval-e5-small</h1> |
|
|
| MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. |
| This model is optimized for information retrieval tasks. It can transform queries and passages to 384 dimensional vectors. |
| The model was developed using a two-step procedure: |
| - In the first step, it was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-small-en) as teacher models for distillation. |
| - The second step involved fine-tuning the obtained models with contrastrive loss on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs. |
|
|
| ⚠️ **2023-12-26:** We have updated the model to a new version with improved results. You can still download the previous version using the **v1** tag: `AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-small", revision="v1")` ⚠️ |
|
|
| ## Usage (Sentence-Transformers) |
|
|
| ⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️ |
|
|
| You can use the model like this with [sentence-transformers](https://www.SBERT.net): |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| from sentence_transformers.util import cos_sim |
| |
| query_prefix = "query: " |
| answer_prefix = "passage: " |
| queries = [query_prefix + "Jak dożyć 100 lat?"] |
| answers = [ |
| answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", |
| answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", |
| answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." |
| ] |
| model = SentenceTransformer("sdadas/mmlw-retrieval-e5-small") |
| queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) |
| answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) |
| |
| best_answer = cos_sim(queries_emb, answers_emb).argmax().item() |
| print(answers[best_answer]) |
| # Trzeba zdrowo się odżywiać i uprawiać sport. |
| ``` |
|
|
| ## Evaluation Results |
|
|
| The model achieves **NDCG@10** of **52.34** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. |
|
|
| ## Acknowledgements |
| This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{dadas2024pirb, |
| title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, |
| author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata}, |
| year={2024}, |
| eprint={2402.13350}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL} |
| } |
| ``` |