Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,68 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- euphemism-detection
|
| 7 |
+
- xlm-roberta
|
| 8 |
+
- text-classification
|
| 9 |
+
datasets:
|
| 10 |
+
- custom
|
| 11 |
+
metrics:
|
| 12 |
+
- f1
|
| 13 |
+
pipeline_tag: text-classification
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# Euphemism Detector (V1 — English)
|
| 17 |
+
|
| 18 |
+
> **An updated multilingual version is available:** [hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual) — fine-tuned on 7 languages (EN/TR/ZH/ES/YO/PL/UK) with 0.808 macro-F1 and zero-shot transfer to 22 additional languages.
|
| 19 |
+
|
| 20 |
+
Fine-tuned XLM-RoBERTa-base for euphemism disambiguation on English PETs (Potentially Euphemistic Terms). Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically or literally.
|
| 21 |
+
|
| 22 |
+
This model was fine-tuned with the English PETs dataset created by the NLP Lab at Montclair State University, U.S.A.
|
| 23 |
+
|
| 24 |
+
## Performance (English)
|
| 25 |
+
|
| 26 |
+
| Class | Precision | Recall | F1 |
|
| 27 |
+
|-------|-----------|--------|----|
|
| 28 |
+
| Literal | 0.81 | 0.83 | 0.82 |
|
| 29 |
+
| Euphemistic | 0.88 | 0.86 | 0.87 |
|
| 30 |
+
| **Macro avg** | **0.84** | **0.84** | **0.84** |
|
| 31 |
+
|
| 32 |
+
## Usage
|
| 33 |
+
|
| 34 |
+
The model expects input text with `[PET_BOUNDARY]` tokens marking the target phrase:
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 38 |
+
import torch
|
| 39 |
+
import torch.nn.functional as F
|
| 40 |
+
|
| 41 |
+
tokenizer = AutoTokenizer.from_pretrained("hasancanbiyik/euphemism-detector")
|
| 42 |
+
model = AutoModelForSequenceClassification.from_pretrained("hasancanbiyik/euphemism-detector")
|
| 43 |
+
model.eval()
|
| 44 |
+
|
| 45 |
+
text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
|
| 46 |
+
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
|
| 47 |
+
|
| 48 |
+
with torch.no_grad():
|
| 49 |
+
probs = F.softmax(model(**inputs).logits, dim=1).squeeze()
|
| 50 |
+
|
| 51 |
+
print(f"Euphemistic: {probs[1].item():.1%}")
|
| 52 |
+
print(f"Literal: {probs[0].item():.1%}")
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Updated Version
|
| 56 |
+
|
| 57 |
+
For multilingual support (7 training languages + zero-shot transfer to 22 additional languages), batch prediction, and improved performance, see the V2 model:
|
| 58 |
+
|
| 59 |
+
**[hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual)**
|
| 60 |
+
|
| 61 |
+
## Research Context
|
| 62 |
+
|
| 63 |
+
- **Biyik, H. C.**, Lee, P., & Feldman, A. (2024). *Turkish Delights: A Dataset on Turkish Euphemisms.* SIGTURK at ACL 2024. [arXiv:2407.13040](https://arxiv.org/abs/2407.13040)
|
| 64 |
+
- **Biyik, H. C.**, Barak, L., Peng, J., & Feldman, A. (2026). *When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English.* SIGTURK at EACL 2026. [arXiv:2602.16957](https://arxiv.org/abs/2602.16957)
|
| 65 |
+
|
| 66 |
+
## License
|
| 67 |
+
|
| 68 |
+
MIT
|