hasancanbiyik
/

euphemism-detector

Text Classification

euphemism-detection

Model card Files Files and versions

hasancanbiyik commited on 28 days ago

Commit

abc60a7

·

verified ·

1 Parent(s): fd72ce6

Update README.md

Files changed (1) hide show

README.md +64 -1

README.md CHANGED Viewed

@@ -1,5 +1,68 @@
 ---
 license: mit
 ---
-This model was fine-tuned with the English PETs dataset created by the NLP Lab at the Montclair State University, U.S.A..

 ---
+language:
+  - en
 license: mit
+tags:
+  - euphemism-detection
+  - xlm-roberta
+  - text-classification
+datasets:
+  - custom
+metrics:
+  - f1
+pipeline_tag: text-classification
 ---
+# Euphemism Detector (V1 — English)
+> **An updated multilingual version is available:** [hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual) — fine-tuned on 7 languages (EN/TR/ZH/ES/YO/PL/UK) with 0.808 macro-F1 and zero-shot transfer to 22 additional languages.
+Fine-tuned XLM-RoBERTa-base for euphemism disambiguation on English PETs (Potentially Euphemistic Terms). Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically or literally.
+This model was fine-tuned with the English PETs dataset created by the NLP Lab at Montclair State University, U.S.A.
+## Performance (English)
+| Class | Precision | Recall | F1 |
+|-------|-----------|--------|----|
+| Literal | 0.81 | 0.83 | 0.82 |
+| Euphemistic | 0.88 | 0.86 | 0.87 |
+| **Macro avg** | **0.84** | **0.84** | **0.84** |
+## Usage
+The model expects input text with `[PET_BOUNDARY]` tokens marking the target phrase:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import torch.nn.functional as F
+tokenizer = AutoTokenizer.from_pretrained("hasancanbiyik/euphemism-detector")
+model = AutoModelForSequenceClassification.from_pretrained("hasancanbiyik/euphemism-detector")
+model.eval()
+text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
+inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
+with torch.no_grad():
+    probs = F.softmax(model(**inputs).logits, dim=1).squeeze()
+print(f"Euphemistic: {probs[1].item():.1%}")
+print(f"Literal:     {probs[0].item():.1%}")
+```
+## Updated Version
+For multilingual support (7 training languages + zero-shot transfer to 22 additional languages), batch prediction, and improved performance, see the V2 model:
+**[hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual)**
+## Research Context
+- **Biyik, H. C.**, Lee, P., & Feldman, A. (2024). *Turkish Delights: A Dataset on Turkish Euphemisms.* SIGTURK at ACL 2024. [arXiv:2407.13040](https://arxiv.org/abs/2407.13040)
+- **Biyik, H. C.**, Barak, L., Peng, J., & Feldman, A. (2026). *When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English.* SIGTURK at EACL 2026. [arXiv:2602.16957](https://arxiv.org/abs/2602.16957)
+## License
+MIT