Euphemism Detector — Multilingual

Fine-tuned XLM-RoBERTa-base for euphemism disambiguation across 7 languages. Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically (e.g., "passed away" meaning death) or literally (e.g., "the ball passed away from the goalkeeper").

Live demo: HuggingFace Spaces

GitHub: hasancanbiyik/euphemism-detector

Model Description

Euphemisms are context-dependent — the same phrase can be euphemistic in one context and literal in another. This model detects that distinction by learning contextual signals around Potentially Euphemistic Terms (PETs), which are marked with special [PET_BOUNDARY] tokens in the input.

The model was fine-tuned on 19,490 labeled examples across 7 languages, with class-weighted loss to handle label imbalance, fp16 mixed-precision training, and early stopping.

Training Languages

Language	Examples	Euph/Lit Ratio	Macro-F1
English	3,098	1.5:1	0.800
Turkish	2,436	1.5:1	0.760
Chinese (Mandarin)	3,211	2.2:1	0.834
Spanish	2,952	2.0:1	0.828
Yoruba	2,598	1.9:1	0.840
Polish	2,439	1.0:1	0.810
Ukrainian	2,776	3.3:1	0.777
Overall	19,490		0.808

Zero-Shot Cross-Lingual Transfer

The model was evaluated on 22 unseen languages across 12 language families using curated minimal-pair benchmarks (synthetic, LLM-generated — see Limitations). Results demonstrate broad cross-lingual transfer, with 15/22 languages exceeding 0.70 macro-F1 and 7 exceeding 0.85.

Language	Family	F1	n
Portuguese	Romance	0.906	11
Indonesian	Austronesian	0.899	10
Swedish	Germanic	0.899	10
Hebrew	Semitic	0.899	10
Danish	Germanic	0.883	9
Hindi	Indo-Aryan	0.862	9
German	Germanic	0.844	14
Italian	Romance	0.829	12
Korean	Koreanic	0.804	10
Hungarian	Uralic	0.800	10
Romanian	Romance	0.800	9
Arabic	Semitic	0.792	10
Armenian	Armenian	0.733	4
French	Romance	0.708	14
Vietnamese	Austroasiatic	0.697	10
Dutch	Germanic	0.697	10
Japanese	Japonic	0.694	11
Czech	Slavic	0.670	10
Russian	Slavic	0.670	10
Greek	Hellenic	0.600	10
Swahili	Bantu	0.600	10
Finnish	Uralic	0.550	9

Key finding: Transfer strength correlates more with euphemistic semantic category than with language family alone. Appearance euphemisms (F1: 1.00) and death euphemisms (F1: 0.79) transfer best across all language families. Strong transfer was observed to typologically distant languages (Arabic/Semitic: 0.79, Korean/Koreanic: 0.80, Indonesian/Austronesian: 0.90), while some typologically close languages showed weaker transfer (Czech/Slavic: 0.67 despite Polish and Ukrainian in training).

How to Use

Input Format

The model expects input text with [PET_BOUNDARY] tokens marking the potentially euphemistic term:

My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday.

Python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name = "hasancanbiyik/euphemism-detector-multilingual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=1).squeeze()

print(f"Euphemistic: {probs[1].item():.1%}")
print(f"Literal:     {probs[0].item():.1%}")
# Output: Euphemistic: 95.2%, Literal: 4.8%

API (via deployed Space)

curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/predict \
  -H "Content-Type: application/json" \
  -d '{"sentence": "He was let go from the company.", "phrase": "let go"}'

Batch Prediction

The deployed API supports CSV batch prediction:

curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/batch/predict \
  -F "file=@examples.csv"

CSV format: sentence,phrase columns.

Training Details

Training Procedure

Base model: xlm-roberta-base (Conneau et al., 2020)
Task: Binary classification (euphemistic vs. literal)
Special tokens: [PET_BOUNDARY] added to vocabulary (vocab size: 250,003)
Loss: Cross-entropy with class weights (literal: 1.392, euphemistic: 0.780)
Optimizer: AdamW
Learning rate: 1e-5
Batch size: 32
Max epochs: 30 (early stopped at epoch 15)
Early stopping patience: 5 (on validation macro-F1)
Mixed precision: fp16
Max sequence length: 256 tokens
Train/Val/Test split: 80/10/10, stratified by language and label

Data Preprocessing

Seven datasets in three different schemas were unified into a single training format:

English, Chinese, Spanish, Yoruba: Already in text,label format with [PET_BOUNDARY] markers
Turkish: Required [PET BOUNDARY] → [PET_BOUNDARY] normalization
Polish: Three-column context format (left/sentence/right) with separate PET column — required concatenation and PET boundary insertion; 58 unrecoverable rows dropped
Ukrainian: Angle bracket <PET> markers converted to [PET_BOUNDARY]; emojis stripped; "war" category downsampled from 4,743 to 500 to prevent category dominance

Behavioral Testing

A 26-test behavioral QA suite validates model robustness:

Known euphemistic/literal pairs: 12 tests (all pass)
Negation robustness: 2 tests (expected failure — negation context overwhelms euphemistic signal; documented as known limitation)
Boundary token edge cases: 4 tests (all pass)
Cross-lingual consistency: 2 tests (all pass)
Confidence calibration: 2 tests (all pass)
Surface invariance (case, punctuation, whitespace): 4 tests (all pass)

Result: 23 passed, 3 xfail (documented limitations)

Limitations

Zero-shot evaluation uses synthetic test data: The 22-language cross-lingual benchmark was curated using LLM-generated examples (Gemini), not native-speaker-validated data. Distributional overlap between the evaluation data and XLM-R's pretraining corpus may inflate zero-shot performance estimates. Native-speaker validation is required before deployment claims can be made for unseen languages.
Small zero-shot sample sizes: 9–14 examples per unseen language. Per-language F1 scores have wide confidence intervals and should be interpreted as preliminary estimates.
Negation sensitivity: The model tends to classify negated euphemisms as literal (e.g., "He didn't pass away"). Negation provides strong literal-context signals that overwhelm the euphemistic sense of the marked phrase.
Ukrainian class imbalance: Ukrainian data has a 3.3:1 euphemistic/literal ratio even after downsampling, which may affect per-language calibration.
Culture-specific euphemisms: The model performs best on universal euphemistic categories (death, appearance) and may underperform on culture-specific euphemisms without cross-lingual parallels in the training data.
Low-frequency PETs: Rare or archaic euphemisms (e.g., "powder her nose") may be classified with low confidence or incorrectly.

Research Context

This model was developed as part of ongoing NLP research on cross-lingual euphemism detection:

Biyik, H. C., Barak, L., Peng, J., & Feldman, A. (2026). When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English. SIGTURK at EACL 2026, Rabat, Morocco. arXiv:2602.16957
Biyik, H. C., Lee, P., & Feldman, A. (2024). Turkish Delights: A Dataset on Turkish Euphemisms. SIGTURK at ACL 2024, Bangkok, Thailand. arXiv:2407.13040
Lee, P., et al. (2024). MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms. Findings of EACL 2024. ACL Anthology

The zero-shot cross-lingual evaluation extends Section 6 ("Future Work") of Lee et al. (2024), which called for testing additional languages from diverse language families.

Citation

@inproceedings{biyik2026semantic,
  title={When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English},
  author={Biyik, Hasan Can and Barak, Libby and Peng, Jing and Feldman, Anna},
  booktitle={Proceedings of SIGTURK at EACL 2026},
  year={2026},
  address={Rabat, Morocco}
}

License

MIT

Downloads last month: 6

Space using hasancanbiyik/euphemism-detector-multilingual 1

Papers for hasancanbiyik/euphemism-detector-multilingual

When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

Paper • 2602.16957 • Published Feb 18

Turkish Delights: a Dataset on Turkish Euphemisms

Paper • 2407.13040 • Published Jul 17, 2024

Evaluation results

Macro F1 (Overall)
self-reported

0.808
F1 English
self-reported

0.800
F1 Turkish
self-reported

0.760
F1 Chinese
self-reported

0.834
F1 Spanish
self-reported

0.828
F1 Yoruba
self-reported

0.840
F1 Polish
self-reported

0.810
F1 Ukrainian
self-reported

0.777