hasancanbiyik commited on
Commit
abc60a7
·
verified ·
1 Parent(s): fd72ce6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -1
README.md CHANGED
@@ -1,5 +1,68 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This model was fine-tuned with the English PETs dataset created by the NLP Lab at the Montclair State University, U.S.A..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: mit
5
+ tags:
6
+ - euphemism-detection
7
+ - xlm-roberta
8
+ - text-classification
9
+ datasets:
10
+ - custom
11
+ metrics:
12
+ - f1
13
+ pipeline_tag: text-classification
14
  ---
15
 
16
+ # Euphemism Detector (V1 English)
17
+
18
+ > **An updated multilingual version is available:** [hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual) — fine-tuned on 7 languages (EN/TR/ZH/ES/YO/PL/UK) with 0.808 macro-F1 and zero-shot transfer to 22 additional languages.
19
+
20
+ Fine-tuned XLM-RoBERTa-base for euphemism disambiguation on English PETs (Potentially Euphemistic Terms). Given a sentence with a marked phrase, the model predicts whether the phrase is used euphemistically or literally.
21
+
22
+ This model was fine-tuned with the English PETs dataset created by the NLP Lab at Montclair State University, U.S.A.
23
+
24
+ ## Performance (English)
25
+
26
+ | Class | Precision | Recall | F1 |
27
+ |-------|-----------|--------|----|
28
+ | Literal | 0.81 | 0.83 | 0.82 |
29
+ | Euphemistic | 0.88 | 0.86 | 0.87 |
30
+ | **Macro avg** | **0.84** | **0.84** | **0.84** |
31
+
32
+ ## Usage
33
+
34
+ The model expects input text with `[PET_BOUNDARY]` tokens marking the target phrase:
35
+
36
+ ```python
37
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
38
+ import torch
39
+ import torch.nn.functional as F
40
+
41
+ tokenizer = AutoTokenizer.from_pretrained("hasancanbiyik/euphemism-detector")
42
+ model = AutoModelForSequenceClassification.from_pretrained("hasancanbiyik/euphemism-detector")
43
+ model.eval()
44
+
45
+ text = "My grandmother [PET_BOUNDARY]passed away[PET_BOUNDARY] last Tuesday."
46
+ inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
47
+
48
+ with torch.no_grad():
49
+ probs = F.softmax(model(**inputs).logits, dim=1).squeeze()
50
+
51
+ print(f"Euphemistic: {probs[1].item():.1%}")
52
+ print(f"Literal: {probs[0].item():.1%}")
53
+ ```
54
+
55
+ ## Updated Version
56
+
57
+ For multilingual support (7 training languages + zero-shot transfer to 22 additional languages), batch prediction, and improved performance, see the V2 model:
58
+
59
+ **[hasancanbiyik/euphemism-detector-multilingual](https://huggingface.co/hasancanbiyik/euphemism-detector-multilingual)**
60
+
61
+ ## Research Context
62
+
63
+ - **Biyik, H. C.**, Lee, P., & Feldman, A. (2024). *Turkish Delights: A Dataset on Turkish Euphemisms.* SIGTURK at ACL 2024. [arXiv:2407.13040](https://arxiv.org/abs/2407.13040)
64
+ - **Biyik, H. C.**, Barak, L., Peng, J., & Feldman, A. (2026). *When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English.* SIGTURK at EACL 2026. [arXiv:2602.16957](https://arxiv.org/abs/2602.16957)
65
+
66
+ ## License
67
+
68
+ MIT