Sentence Similarity
sentence-transformers
Safetensors
Transformers
Polish
roberta
feature-extraction
information-retrieval
custom_code
text-embeddings-inference
Instructions to use JakubJanusz/roberta_large_v2_ownRep with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use JakubJanusz/roberta_large_v2_ownRep with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("JakubJanusz/roberta_large_v2_ownRep", trust_remote_code=True) sentences = [ "[query]: Jak dożyć 100 lat?", "Trzeba zdrowo się odżywiać i uprawiać sport.", "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use JakubJanusz/roberta_large_v2_ownRep with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("JakubJanusz/roberta_large_v2_ownRep", trust_remote_code=True) model = AutoModel.from_pretrained("JakubJanusz/roberta_large_v2_ownRep", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - transformers | |
| - information-retrieval | |
| language: pl | |
| license: gemma | |
| widget: | |
| - source_sentence: "[query]: Jak dożyć 100 lat?" | |
| sentences: | |
| - "Trzeba zdrowo się odżywiać i uprawiać sport." | |
| - "Trzeba pić alkohol, imprezować i jeździć szybkimi autami." | |
| - "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." | |
| <h1 align="center">MMLW-retrieval-roberta-large-v2</h1> | |
| MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. The second version is based on the same foundational model ([polish-roberta-large-v2](https://huggingface.co/sdadas/polish-roberta-large-v2)), but the training process incorporated modern LLM-based English retrievers and rerankers, which led to improved results. | |
| This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors. | |
| The model was developed using a two-step procedure: | |
| - In the first step, it was initialized with Polish RoBERTa checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 20 million Polish-English text pairs. We utilised [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) as the teacher models for distillation. | |
| - The second step involved fine-tuning the model with contrastrive loss using a dataset consisting of over 4 million queries. Positive and negative passages for each query have been selected with the help of [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) reranker. | |
| ## Usage (Sentence-Transformers) | |
| The model supports both information retrieval and semantic textual similarity. For retrieval, queries should be prefixed with **"[query]: "**. For symmetric tasks such as semantic similarity, both texts should be prefixed with **"[sts]: "**. | |
| Please note that the model uses a custom implementation, so you should add `trust_remote_code=True` argument when loading it. | |
| It is also recommended to use Flash Attention 2, which can be enabled with `attn_implementation` argument. | |
| You can use the model like this with [sentence-transformers](https://www.SBERT.net): | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| from sentence_transformers.util import cos_sim | |
| model = SentenceTransformer( | |
| "sdadas/mmlw-retrieval-roberta-large-v2", | |
| trust_remote_code=True, | |
| device="cuda", | |
| model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True} | |
| ) | |
| # Flash-Attention works only in 16-bit mode, so we need to cast the model to float16 or bfloat16 | |
| model.bfloat16() | |
| # Retrieval example | |
| query_prefix = "[query]: " | |
| queries = [query_prefix + "Jak dożyć 100 lat?"] | |
| answers = [ | |
| "Trzeba zdrowo się odżywiać i uprawiać sport.", | |
| "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", | |
| "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." | |
| ] | |
| queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) | |
| answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) | |
| best_answer = cos_sim(queries_emb, answers_emb).argmax().item() | |
| print(answers[best_answer]) | |
| # Semantic similarity example | |
| sim_prefix = "[sts]: " | |
| sentences = [ | |
| sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", | |
| sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.", | |
| sim_prefix + "One should eat healthy and engage in sports.", | |
| sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji." | |
| ] | |
| emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False) | |
| print(cos_sim(emb, emb)) | |
| ``` | |
| ## Evaluation Results | |
| The model achieves **NDCG@10** of **60.71** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{dadas2024pirb, | |
| title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, | |
| author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}}, | |
| booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, | |
| pages={12761--12774}, | |
| year={2024} | |
| } | |
| ``` |