ChangeIsKey
/

graded-wsd

+---
+language:
+- en
+base_model:
+- FacebookAI/roberta-large
+pipeline_tag: text-classification
+---
+# Graded Word Sense Disambiguation (WSD) Model
+## Model Summary
+This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in:
+**Reference Paper:**
+Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation.
+This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis.
+---
+## Model Details
+- **Base Model:** `roberta-large`
+- **Task:** Graded Word Sense Disambiguation (WSD)
+- **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus
+- **Training Steps:**
+  - Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context.
+  - Dataset preprocessed with **sense annotations** and **word offsets**.
+  - Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets.
+  - **Objective:** Predicting a continuous label representing the applicability of a sense.
+  - **Evaluation Metric:** Root Mean Squared Error (RMSE).
+- **Batch Size:** 32
+- **Learning Rate:** 2e-5
+- **Epochs:** 1
+- **Optimizer:** AdamW with weight decay of 0.01
+- **Evaluation Strategy:** Steps-based (every 10% of the dataset).
+---
+## Training & Fine-Tuning
+Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows:
+1. **Preprocessing**
+   - Example sentences were extracted from the OED and augmented with **definitions**.
+   - The target word was **highlighted** with special tokens (`<t>`, `</t>`).
+   - Each instance was labeled with a **graded similarity score**.
+2. **Tokenization & Encoding**
+   - Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`.
+   - Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**.
+3. **Training Pipeline**
+   - Model fine-tuned on the **regression task** with a single **linear output head**.
+   - Used **Mean Squared Error (MSE) loss**.
+   - Evaluation on validation set using **Root Mean Squared Error (RMSE)**.
+---
+## Usage
+### Example Code
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd")
+model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd")
+sentence = "The bank of the river was eroding due to the storm."
+target_word = "bank"
+definition = "The land alongside a river or a stream."
+tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt")
+with torch.no_grad():
+    output = model(**tokenized_input)
+    score = output.logits.item()
+print(f"Graded Sense Score: {score}")
+```
+### Input Format
+- Sentence: Contextual usage of the word.
+- Target Word: The word to be disambiguated.
+- Definition: The dictionary definition of the intended sense.
+### Output
+- **A continuous score** (between 0 and 1) indicating the **likelihood** that the given definition applies to the word in its current context.
+---
+## Citation
+If you use this model, please cite the following paper:
+```
+@article{cassotti2025,
+  title={Sense-specific Historical Word Usage Generation},
+  author={Cassotti, Pierluigi and Tahmasebi, Nina},
+  journal={TACL},
+  year={2025}
+}
+```