Jakub Janusz

initial commit

dca2e94 8 months ago

4.55 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	- information-retrieval
	language: pl
	license: gemma
	widget:
	- source_sentence: "[query]: Jak dożyć 100 lat?"
	sentences:
	- "Trzeba zdrowo się odżywiać i uprawiać sport."
	- "Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
	- "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."

	---

	<h1 align="center">MMLW-retrieval-roberta-large-v2</h1>

	MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. The second version is based on the same foundational model ([polish-roberta-large-v2](https://huggingface.co/sdadas/polish-roberta-large-v2)), but the training process incorporated modern LLM-based English retrievers and rerankers, which led to improved results.
	This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors.
	The model was developed using a two-step procedure:
	- In the first step, it was initialized with Polish RoBERTa checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 20 million Polish-English text pairs. We utilised [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) as the teacher models for distillation.
	- The second step involved fine-tuning the model with contrastrive loss using a dataset consisting of over 4 million queries. Positive and negative passages for each query have been selected with the help of [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) reranker.


	## Usage (Sentence-Transformers)


	The model supports both information retrieval and semantic textual similarity. For retrieval, queries should be prefixed with "[query]: ". For symmetric tasks such as semantic similarity, both texts should be prefixed with "[sts]: ".

	Please note that the model uses a custom implementation, so you should add `trust_remote_code=True` argument when loading it.
	It is also recommended to use Flash Attention 2, which can be enabled with `attn_implementation` argument.
	You can use the model like this with [sentence-transformers](https://www.SBERT.net):

	```python
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.util import cos_sim

	model = SentenceTransformer(
	"sdadas/mmlw-retrieval-roberta-large-v2",
	trust_remote_code=True,
	device="cuda",
	model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
	)
	# Flash-Attention works only in 16-bit mode, so we need to cast the model to float16 or bfloat16
	model.bfloat16()

	# Retrieval example
	query_prefix = "[query]: "
	queries = [query_prefix + "Jak dożyć 100 lat?"]
	answers = [
	"Trzeba zdrowo się odżywiać i uprawiać sport.",
	"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
	"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
	]
	queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
	answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
	best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
	print(answers[best_answer])

	# Semantic similarity example
	sim_prefix = "[sts]: "
	sentences = [
	sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
	sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
	sim_prefix + "One should eat healthy and engage in sports.",
	sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
	]
	emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
	print(cos_sim(emb, emb))

	```

	## Evaluation Results

	The model achieves NDCG@10 of 60.71 on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.


	## Citation

	```bibtex
	@inproceedings{dadas2024pirb,
	title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
	author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}},
	booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
	pages={12761--12774},
	year={2024}
	}
	```