Update README.md

9d907af verified 17 days ago

3.8 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	- information-retrieval
	language: pl
	license: apache-2.0
	widget:
	- source_sentence: "query: Jak dożyć 100 lat?"
	sentences:
	- "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
	- "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
	- "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."

	---

	<h1 align="center">MMLW-retrieval-e5-large</h1>

	MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
	This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors.
	The model was developed using a two-step procedure:
	- In the first step, it was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-large-en) as teacher models for distillation.
	- The second step involved fine-tuning the obtained models with contrastrive loss on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.

	⚠️ 2023-12-26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: `AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-large", revision="v1")` ⚠️

	## Usage (Sentence-Transformers)

	⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with "query: " and passages with "passage: " ⚠️

	You can use the model like this with [sentence-transformers](https://www.SBERT.net):

	```python
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.util import cos_sim

	query_prefix = "query: "
	answer_prefix = "passage: "
	queries = [query_prefix + "Jak dożyć 100 lat?"]
	answers = [
	answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
	answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
	answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
	]
	model = SentenceTransformer("sdadas/mmlw-retrieval-e5-large")
	queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
	answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

	best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
	print(answers[best_answer])
	# Trzeba zdrowo się odżywiać i uprawiać sport.
	```

	## Evaluation Results
	The model achieves NDCG@10 of 58.30 on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.

	## Acknowledgements
	This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.

	## Citation

	```bibtex
	@inproceedings{dadas2024pirb,
	title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
	author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}},
	booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
	pages={12761--12774},
	year={2024}
	}
	```