Euphemism Detector β€” Multilingual

Fine-tuned XLM-RoBERTa-base for euphemism disambiguation across 7 languages. Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically (e.g., "passed away" meaning death) or literally (e.g., "the ball passed away from the goalkeeper").

Live demo: HuggingFace Spaces

GitHub: hasancanbiyik/euphemism-detector


Model Description

Euphemisms are context-dependent β€” the same phrase can be euphemistic in one context and literal in another. This model detects that distinction by learning contextual signals around Potentially Euphemistic Terms (PETs), which are marked with special [PET_BOUNDARY] tokens in the input.

The model was fine-tuned on 19,490 labeled examples across 7 languages, with class-weighted loss to handle label imbalance, fp16 mixed-precision training, and early stopping.

Training Languages

Language Examples Euph/Lit Ratio Macro-F1
English 3,098 1.5:1 0.800
Turkish 2,436 1.5:1 0.760
Chinese (Mandarin) 3,211 2.2:1 0.834
Spanish 2,952 2.0:1 0.828
Yoruba 2,598 1.9:1 0.840
Polish 2,439 1.0:1 0.810
Ukrainian 2,776 3.3:1 0.777
Overall 19,490 0.808

Zero-Shot Cross-Lingual Transfer

The model was evaluated on 22 unseen languages across 12 language families using curated minimal-pair benchmarks (synthetic, LLM-generated β€” see Limitations). Results demonstrate broad cross-lingual transfer, with 15/22 languages exceeding 0.70 macro-F1 and 7 exceeding 0.85.

Language Family F1 n
Portuguese Romance 0.906 11
Indonesian Austronesian 0.899 10
Swedish Germanic 0.899 10
Hebrew Semitic 0.899 10
Danish Germanic 0.883 9
Hindi Indo-Aryan 0.862 9
German Germanic 0.844 14
Italian Romance 0.829 12
Korean Koreanic 0.804 10
Hungarian Uralic 0.800 10
Romanian Romance 0.800 9
Arabic Semitic 0.792 10
Armenian Armenian 0.733 4
French Romance 0.708 14
Vietnamese Austroasiatic 0.697 10
Dutch Germanic 0.697 10
Japanese Japonic 0.694 11
Czech Slavic 0.670 10
Russian Slavic 0.670 10
Greek Hellenic 0.600 10
Swahili Bantu 0.600 10
Finnish Uralic 0.550 9

Key finding: Transfer strength correlates more with euphemistic semantic category than with language family alone. Appearance euphemisms (F1: 1.00) and death euphemisms (F1: 0.79) transfer best across all language families. Strong transfer was observed to typologically distant languages (Arabic/Semitic: 0.79, Korean/Koreanic: 0.80, Indonesian/Austronesian: 0.90), while some typologically close languages showed weaker transfer (Czech/Slavic: 0.67 despite Polish and Ukrainian in training).


How to Use

Input Format

The model expects input text with [PET_BOUNDARY] tokens marking the potentially euphemistic term:

My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday.

Python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name = "hasancanbiyik/euphemism-detector-multilingual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=1).squeeze()

print(f"Euphemistic: {probs[1].item():.1%}")
print(f"Literal:     {probs[0].item():.1%}")
# Output: Euphemistic: 95.2%, Literal: 4.8%

API (via deployed Space)

curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/predict \
  -H "Content-Type: application/json" \
  -d '{"sentence": "He was let go from the company.", "phrase": "let go"}'

Batch Prediction

The deployed API supports CSV batch prediction:

curl -X POST https://hasancanbiyik-euphemism-detector.hf.space/batch/predict \
  -F "file=@examples.csv"

CSV format: sentence,phrase columns.


Training Details

Training Procedure

  • Base model: xlm-roberta-base (Conneau et al., 2020)
  • Task: Binary classification (euphemistic vs. literal)
  • Special tokens: [PET_BOUNDARY] added to vocabulary (vocab size: 250,003)
  • Loss: Cross-entropy with class weights (literal: 1.392, euphemistic: 0.780)
  • Optimizer: AdamW
  • Learning rate: 1e-5
  • Batch size: 32
  • Max epochs: 30 (early stopped at epoch 15)
  • Early stopping patience: 5 (on validation macro-F1)
  • Mixed precision: fp16
  • Max sequence length: 256 tokens
  • Train/Val/Test split: 80/10/10, stratified by language and label

Data Preprocessing

Seven datasets in three different schemas were unified into a single training format:

  • English, Chinese, Spanish, Yoruba: Already in text,label format with [PET_BOUNDARY] markers
  • Turkish: Required [PET BOUNDARY] β†’ [PET_BOUNDARY] normalization
  • Polish: Three-column context format (left/sentence/right) with separate PET column β€” required concatenation and PET boundary insertion; 58 unrecoverable rows dropped
  • Ukrainian: Angle bracket <PET> markers converted to [PET_BOUNDARY]; emojis stripped; "war" category downsampled from 4,743 to 500 to prevent category dominance

Behavioral Testing

A 26-test behavioral QA suite validates model robustness:

  • Known euphemistic/literal pairs: 12 tests (all pass)
  • Negation robustness: 2 tests (expected failure β€” negation context overwhelms euphemistic signal; documented as known limitation)
  • Boundary token edge cases: 4 tests (all pass)
  • Cross-lingual consistency: 2 tests (all pass)
  • Confidence calibration: 2 tests (all pass)
  • Surface invariance (case, punctuation, whitespace): 4 tests (all pass)

Result: 23 passed, 3 xfail (documented limitations)


Limitations

  • Zero-shot evaluation uses synthetic test data: The 22-language cross-lingual benchmark was curated using LLM-generated examples (Gemini), not native-speaker-validated data. Distributional overlap between the evaluation data and XLM-R's pretraining corpus may inflate zero-shot performance estimates. Native-speaker validation is required before deployment claims can be made for unseen languages.
  • Small zero-shot sample sizes: 9–14 examples per unseen language. Per-language F1 scores have wide confidence intervals and should be interpreted as preliminary estimates.
  • Negation sensitivity: The model tends to classify negated euphemisms as literal (e.g., "He didn't pass away"). Negation provides strong literal-context signals that overwhelm the euphemistic sense of the marked phrase.
  • Ukrainian class imbalance: Ukrainian data has a 3.3:1 euphemistic/literal ratio even after downsampling, which may affect per-language calibration.
  • Culture-specific euphemisms: The model performs best on universal euphemistic categories (death, appearance) and may underperform on culture-specific euphemisms without cross-lingual parallels in the training data.
  • Low-frequency PETs: Rare or archaic euphemisms (e.g., "powder her nose") may be classified with low confidence or incorrectly.

Research Context

This model was developed as part of ongoing NLP research on cross-lingual euphemism detection:

  • Biyik, H. C., Barak, L., Peng, J., & Feldman, A. (2026). When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English. SIGTURK at EACL 2026, Rabat, Morocco. arXiv:2602.16957

  • Biyik, H. C., Lee, P., & Feldman, A. (2024). Turkish Delights: A Dataset on Turkish Euphemisms. SIGTURK at ACL 2024, Bangkok, Thailand. arXiv:2407.13040

  • Lee, P., et al. (2024). MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms. Findings of EACL 2024. ACL Anthology

The zero-shot cross-lingual evaluation extends Section 6 ("Future Work") of Lee et al. (2024), which called for testing additional languages from diverse language families.


Citation

@inproceedings{biyik2026semantic,
  title={When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English},
  author={Biyik, Hasan Can and Barak, Libby and Peng, Jing and Feldman, Anna},
  booktitle={Proceedings of SIGTURK at EACL 2026},
  year={2026},
  address={Rabat, Morocco}
}

License

MIT

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using hasancanbiyik/euphemism-detector-multilingual 2

Papers for hasancanbiyik/euphemism-detector-multilingual

Evaluation results