Instructions to use sdadas/polish-reranker-roberta-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sdadas/polish-reranker-roberta-v2 with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("sdadas/polish-reranker-roberta-v2") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Transformers
How to use sdadas/polish-reranker-roberta-v2 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("sdadas/polish-reranker-roberta-v2") model = AutoModelForSequenceClassification.from_pretrained("sdadas/polish-reranker-roberta-v2") - Notebooks
- Google Colab
- Kaggle
polish-reranker-roberta-v2
This is an improved version of reranker based on sdadas/polish-roberta-large-v2 trained with RankNet loss on a large dataset of text pairs. The model was trained in the same way and on the same data as sdadas/polish-reranker-large-ranknet, but predictions from BAAI/bge-reranker-v2.5-gemma2-lightweight were used for distillation instead of unicamp-dl/mt5-13b-mmarco-100k.
Our reranker achieves results close to BAAI/bge-reranker-v2.5-gemma2-lightweight on the PIRB benchmark, even outperforming it on some datasets. At the same time, it is over 21 times smaller — 435M vs. 9.24B parameters.
Usage (Huggingface Transformers)
The model can be used with Huggingface Transformers in the following way:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model_name = "sdadas/polish-reranker-roberta-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="cuda"
)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt").to("cuda")
output = model(**tokens)
results = output.logits.detach().cpu().float().numpy()
results = np.squeeze(results)
print(results.tolist())
Evaluation Results
The model achieves NDCG@10 of 65.30 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
Citation
@article{dadas2024assessing,
title={Assessing generalization capability of text ranking models in Polish},
author={Sławomir Dadas and Małgorzata Grębowiec},
year={2024},
eprint={2402.14318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 574