Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
base_model:
|
| 5 |
+
- FacebookAI/roberta-large
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
---
|
| 8 |
+
# Graded Word Sense Disambiguation (WSD) Model
|
| 9 |
+
|
| 10 |
+
## Model Summary
|
| 11 |
+
This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in:
|
| 12 |
+
|
| 13 |
+
**Reference Paper:**
|
| 14 |
+
Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation.
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Model Details
|
| 22 |
+
- **Base Model:** `roberta-large`
|
| 23 |
+
- **Task:** Graded Word Sense Disambiguation (WSD)
|
| 24 |
+
- **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus
|
| 25 |
+
- **Training Steps:**
|
| 26 |
+
- Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context.
|
| 27 |
+
- Dataset preprocessed with **sense annotations** and **word offsets**.
|
| 28 |
+
- Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets.
|
| 29 |
+
- **Objective:** Predicting a continuous label representing the applicability of a sense.
|
| 30 |
+
- **Evaluation Metric:** Root Mean Squared Error (RMSE).
|
| 31 |
+
- **Batch Size:** 32
|
| 32 |
+
- **Learning Rate:** 2e-5
|
| 33 |
+
- **Epochs:** 1
|
| 34 |
+
- **Optimizer:** AdamW with weight decay of 0.01
|
| 35 |
+
- **Evaluation Strategy:** Steps-based (every 10% of the dataset).
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Training & Fine-Tuning
|
| 40 |
+
Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows:
|
| 41 |
+
|
| 42 |
+
1. **Preprocessing**
|
| 43 |
+
- Example sentences were extracted from the OED and augmented with **definitions**.
|
| 44 |
+
- The target word was **highlighted** with special tokens (`<t>`, `</t>`).
|
| 45 |
+
- Each instance was labeled with a **graded similarity score**.
|
| 46 |
+
|
| 47 |
+
2. **Tokenization & Encoding**
|
| 48 |
+
- Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`.
|
| 49 |
+
- Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**.
|
| 50 |
+
|
| 51 |
+
3. **Training Pipeline**
|
| 52 |
+
- Model fine-tuned on the **regression task** with a single **linear output head**.
|
| 53 |
+
- Used **Mean Squared Error (MSE) loss**.
|
| 54 |
+
- Evaluation on validation set using **Root Mean Squared Error (RMSE)**.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Usage
|
| 59 |
+
### Example Code
|
| 60 |
+
```python
|
| 61 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 62 |
+
import torch
|
| 63 |
+
|
| 64 |
+
tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd")
|
| 65 |
+
model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd")
|
| 66 |
+
|
| 67 |
+
sentence = "The bank of the river was eroding due to the storm."
|
| 68 |
+
target_word = "bank"
|
| 69 |
+
definition = "The land alongside a river or a stream."
|
| 70 |
+
|
| 71 |
+
tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt")
|
| 72 |
+
with torch.no_grad():
|
| 73 |
+
output = model(**tokenized_input)
|
| 74 |
+
score = output.logits.item()
|
| 75 |
+
|
| 76 |
+
print(f"Graded Sense Score: {score}")
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Input Format
|
| 80 |
+
- Sentence: Contextual usage of the word.
|
| 81 |
+
- Target Word: The word to be disambiguated.
|
| 82 |
+
- Definition: The dictionary definition of the intended sense.
|
| 83 |
+
|
| 84 |
+
### Output
|
| 85 |
+
- **A continuous score** (between 0 and 1) indicating the **likelihood** that the given definition applies to the word in its current context.
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## Citation
|
| 90 |
+
If you use this model, please cite the following paper:
|
| 91 |
+
|
| 92 |
+
```
|
| 93 |
+
@article{cassotti2025,
|
| 94 |
+
title={Sense-specific Historical Word Usage Generation},
|
| 95 |
+
author={Cassotti, Pierluigi and Tahmasebi, Nina},
|
| 96 |
+
journal={TACL},
|
| 97 |
+
year={2025}
|
| 98 |
+
}
|
| 99 |
+
```
|