Euphemism Detector β Multilingual
Fine-tuned XLM-RoBERTa-base for euphemism disambiguation across 7 languages. Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically (e.g., "passed away" meaning death) or literally (e.g., "the ball passed away from the goalkeeper").
Live demo: HuggingFace Spaces
GitHub: hasancanbiyik/euphemism-detector
Model Description
Euphemisms are context-dependent β the same phrase can be euphemistic in one context and literal in another. This model detects that distinction by learning contextual signals around Potentially Euphemistic Terms (PETs), which are marked with special [PET_BOUNDARY] tokens in the input.
The model was fine-tuned on 19,490 labeled examples across 7 languages, with class-weighted loss to handle label imbalance, fp16 mixed-precision training, and early stopping.
Training Languages
| Language | Examples | Euph/Lit Ratio | Macro-F1 |
|---|---|---|---|
| English | 3,098 | 1.5:1 | 0.800 |
| Turkish | 2,436 | 1.5:1 | 0.760 |
| Chinese (Mandarin) | 3,211 | 2.2:1 | 0.834 |
| Spanish | 2,952 | 2.0:1 | 0.828 |
| Yoruba | 2,598 | 1.9:1 | 0.840 |
| Polish | 2,439 | 1.0:1 | 0.810 |
| Ukrainian | 2,776 | 3.3:1 | 0.777 |
| Overall | 19,490 | 0.808 |
Zero-Shot Cross-Lingual Transfer
The model was evaluated on 22 unseen languages across 12 language families using curated minimal-pair benchmarks (synthetic, LLM-generated β see Limitations). Results demonstrate broad cross-lingual transfer, with 15/22 languages exceeding 0.70 macro-F1 and 7 exceeding 0.85.
| Language | Family | F1 | n |
|---|---|---|---|
| Portuguese | Romance | 0.906 | 11 |
| Indonesian | Austronesian | 0.899 | 10 |
| Swedish | Germanic | 0.899 | 10 |
| Hebrew | Semitic | 0.899 | 10 |
| Danish | Germanic | 0.883 | 9 |
| Hindi | Indo-Aryan | 0.862 | 9 |
| German | Germanic | 0.844 | 14 |
| Italian | Romance | 0.829 | 12 |
| Korean | Koreanic | 0.804 | 10 |
| Hungarian | Uralic | 0.800 | 10 |
| Romanian | Romance | 0.800 | 9 |
| Arabic | Semitic | 0.792 | 10 |
| Armenian | Armenian | 0.733 | 4 |
| French | Romance | 0.708 | 14 |
| Vietnamese | Austroasiatic | 0.697 | 10 |
| Dutch | Germanic | 0.697 | 10 |
| Japanese | Japonic | 0.694 | 11 |
| Czech | Slavic | 0.670 | 10 |
| Russian | Slavic | 0.670 | 10 |
| Greek | Hellenic | 0.600 | 10 |
| Swahili | Bantu | 0.600 | 10 |
| Finnish | Uralic | 0.550 | 9 |
Key finding: Transfer strength correlates more with euphemistic semantic category than with language family alone. Appearance euphemisms (F1: 1.00) and death euphemisms (F1: 0.79) transfer best across all language families. Strong transfer was observed to typologically distant languages (Arabic/Semitic: 0.79, Korean/Koreanic: 0.80, Indonesian/Austronesian: 0.90), while some typologically close languages showed weaker transfer (Czech/Slavic: 0.67 despite Polish and Ukrainian in training).
How to Use
Input Format
The model expects input text with [PET_BOUNDARY] tokens marking the potentially euphemistic term:
My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday.
Python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
model_name = "hasancanbiyik/euphemism-detector-multilingual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=1).squeeze()
print(f"Euphemistic: {probs[1].item():.1%}")
print(f"Literal: {probs[0].item():.1%}")
# Output: Euphemistic: 95.2%, Literal: 4.8%
API (via deployed Space)
curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/predict \
-H "Content-Type: application/json" \
-d '{"sentence": "He was let go from the company.", "phrase": "let go"}'
Batch Prediction
The deployed API supports CSV batch prediction:
curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/batch/predict \
-F "file=@examples.csv"
CSV format: sentence,phrase columns.
Training Details
Training Procedure
- Base model:
xlm-roberta-base(Conneau et al., 2020) - Task: Binary classification (euphemistic vs. literal)
- Special tokens:
[PET_BOUNDARY]added to vocabulary (vocab size: 250,003) - Loss: Cross-entropy with class weights (literal: 1.392, euphemistic: 0.780)
- Optimizer: AdamW
- Learning rate: 1e-5
- Batch size: 32
- Max epochs: 30 (early stopped at epoch 15)
- Early stopping patience: 5 (on validation macro-F1)
- Mixed precision: fp16
- Max sequence length: 256 tokens
- Train/Val/Test split: 80/10/10, stratified by language and label
Data Preprocessing
Seven datasets in three different schemas were unified into a single training format:
- English, Chinese, Spanish, Yoruba: Already in
text,labelformat with[PET_BOUNDARY]markers - Turkish: Required
[PET BOUNDARY]β[PET_BOUNDARY]normalization - Polish: Three-column context format (left/sentence/right) with separate PET column β required concatenation and PET boundary insertion; 58 unrecoverable rows dropped
- Ukrainian: Angle bracket
<PET>markers converted to[PET_BOUNDARY]; emojis stripped; "war" category downsampled from 4,743 to 500 to prevent category dominance
Behavioral Testing
A 26-test behavioral QA suite validates model robustness:
- Known euphemistic/literal pairs: 12 tests (all pass)
- Negation robustness: 2 tests (expected failure β negation context overwhelms euphemistic signal; documented as known limitation)
- Boundary token edge cases: 4 tests (all pass)
- Cross-lingual consistency: 2 tests (all pass)
- Confidence calibration: 2 tests (all pass)
- Surface invariance (case, punctuation, whitespace): 4 tests (all pass)
Result: 23 passed, 3 xfail (documented limitations)
Limitations
- Zero-shot evaluation uses synthetic test data: The 22-language cross-lingual benchmark was curated using LLM-generated examples (Gemini), not native-speaker-validated data. Distributional overlap between the evaluation data and XLM-R's pretraining corpus may inflate zero-shot performance estimates. Native-speaker validation is required before deployment claims can be made for unseen languages.
- Small zero-shot sample sizes: 9β14 examples per unseen language. Per-language F1 scores have wide confidence intervals and should be interpreted as preliminary estimates.
- Negation sensitivity: The model tends to classify negated euphemisms as literal (e.g., "He didn't pass away"). Negation provides strong literal-context signals that overwhelm the euphemistic sense of the marked phrase.
- Ukrainian class imbalance: Ukrainian data has a 3.3:1 euphemistic/literal ratio even after downsampling, which may affect per-language calibration.
- Culture-specific euphemisms: The model performs best on universal euphemistic categories (death, appearance) and may underperform on culture-specific euphemisms without cross-lingual parallels in the training data.
- Low-frequency PETs: Rare or archaic euphemisms (e.g., "powder her nose") may be classified with low confidence or incorrectly.
Research Context
This model was developed as part of ongoing NLP research on cross-lingual euphemism detection:
Biyik, H. C., Barak, L., Peng, J., & Feldman, A. (2026). When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English. SIGTURK at EACL 2026, Rabat, Morocco. arXiv:2602.16957
Biyik, H. C., Lee, P., & Feldman, A. (2024). Turkish Delights: A Dataset on Turkish Euphemisms. SIGTURK at ACL 2024, Bangkok, Thailand. arXiv:2407.13040
Lee, P., et al. (2024). MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms. Findings of EACL 2024. ACL Anthology
The zero-shot cross-lingual evaluation extends Section 6 ("Future Work") of Lee et al. (2024), which called for testing additional languages from diverse language families.
Citation
@inproceedings{biyik2026semantic,
title={When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English},
author={Biyik, Hasan Can and Barak, Libby and Peng, Jing and Feldman, Anna},
booktitle={Proceedings of SIGTURK at EACL 2026},
year={2026},
address={Rabat, Morocco}
}
License
MIT
- Downloads last month
- 24
Spaces using hasancanbiyik/euphemism-detector-multilingual 2
Papers for hasancanbiyik/euphemism-detector-multilingual
Turkish Delights: a Dataset on Turkish Euphemisms
Evaluation results
- Macro F1 (Overall)self-reported0.808
- F1 Englishself-reported0.800
- F1 Turkishself-reported0.760
- F1 Chineseself-reported0.834
- F1 Spanishself-reported0.828
- F1 Yorubaself-reported0.840
- F1 Polishself-reported0.810
- F1 Ukrainianself-reported0.777