Euphemism Detector (V1 — English)

An updated multilingual version is available: hasancanbiyik/euphemism-detector-multilingual — fine-tuned on 7 languages (EN/TR/ZH/ES/YO/PL/UK) with 0.808 macro-F1 and zero-shot transfer to 22 additional languages.

Fine-tuned XLM-RoBERTa-base for euphemism disambiguation on English PETs (Potentially Euphemistic Terms). Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically or literally.

This model was fine-tuned with the English PETs dataset created by the NLP Lab at Montclair State University, U.S.A.

Performance (English)

Class	Precision	Recall	F1
Literal	0.81	0.83	0.82
Euphemistic	0.88	0.86	0.87
Macro avg	0.84	0.84	0.84

Usage

The model expects input text with [PET_BOUNDARY] tokens marking the target phrase:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("hasancanbiyik/euphemism-detector")
model = AutoModelForSequenceClassification.from_pretrained("hasancanbiyik/euphemism-detector")
model.eval()

text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
    probs = F.softmax(model(**inputs).logits, dim=1).squeeze()

print(f"Euphemistic: {probs[1].item():.1%}")
print(f"Literal:     {probs[0].item():.1%}")

Updated Version

For multilingual support (7 training languages + zero-shot transfer to 22 additional languages), batch prediction, and improved performance, see the V2 model:

hasancanbiyik/euphemism-detector-multilingual

Research Context

Biyik, H. C., Lee, P., & Feldman, A. (2024). Turkish Delights: A Dataset on Turkish Euphemisms. SIGTURK at ACL 2024. arXiv:2407.13040
Biyik, H. C., Barak, L., Peng, J., & Feldman, A. (2026). When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English. SIGTURK at EACL 2026. arXiv:2602.16957

License

MIT

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32

Papers for hasancanbiyik/euphemism-detector

When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

Paper • 2602.16957 • Published Feb 18

Turkish Delights: a Dataset on Turkish Euphemisms

Paper • 2407.13040 • Published Jul 17, 2024