merionum/ru_paraphraser
Viewer • Updated • 9.15k • 490 • 12
How to use cointegrated/rubert-base-cased-dp-paraphrase-detection with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="cointegrated/rubert-base-cased-dp-paraphrase-detection") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-base-cased-dp-paraphrase-detection")
model = AutoModelForSequenceClassification.from_pretrained("cointegrated/rubert-base-cased-dp-paraphrase-detection")This is a version of paraphrase detector by DeepPavlov (details in the documentation) ported to the Transformers format.
All credit goes to the authors of DeepPavlov.
The model has been trained on the dataset from http://paraphraser.ru/.
It classifies texts as paraphrases (class 1) or non-paraphrases (class 0).
import torch
from transformers import AutoModelForSequenceClassification, BertTokenizer
model_name = 'cointegrated/rubert-base-cased-dp-paraphrase-detection'
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
tokenizer = BertTokenizer.from_pretrained(model_name)
def compare_texts(text1, text2):
batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
with torch.inference_mode():
proba = torch.softmax(model(**batch).logits, -1).cpu().numpy()
return proba[0] # p(non-paraphrase), p(paraphrase)
print(compare_texts('Сегодня на улице хорошая погода', 'Сегодня на улице отвратительная погода'))
# [0.7056226 0.2943774]
print(compare_texts('Сегодня на улице хорошая погода', 'Отличная погодка сегодня выдалась'))
# [0.16524374 0.8347562 ]
P.S. In the DeepPavlov repository, the tokenizer uses max_seq_length=64.
This model, however, uses model_max_length=512.
Therefore, the results on long texts may be inadequate.