Update README.md
Browse files
README.md
CHANGED
|
@@ -10,20 +10,20 @@ base_model:
|
|
| 10 |
|
| 11 |
## Model Description
|
| 12 |
|
| 13 |
-
**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a bi-encoder for entity linking tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
|
| 14 |
|
| 15 |
## Intended Uses
|
| 16 |
|
| 17 |
### Primary Use Cases
|
| 18 |
-
- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed
|
| 19 |
- **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
|
| 20 |
- **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
|
|
|
|
| 21 |
|
| 22 |
### Recommended Preprocessing
|
| 23 |
-
- Use `[ENT]` tokens to mark entity
|
| 24 |
-
- Consider using NER
|
| 25 |
-
- For non-standard entities (e.g., "daytime"), extract noun phrases
|
| 26 |
-
- Clean and filter knowledge base entries to remove irrelevant concepts
|
| 27 |
|
| 28 |
## Code Example
|
| 29 |
|
|
@@ -52,7 +52,6 @@ model.to(device)
|
|
| 52 |
# Verify the special token is there
|
| 53 |
print('[ENT]' in tokenizer.get_added_vocab())
|
| 54 |
|
| 55 |
-
|
| 56 |
context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
|
| 57 |
|
| 58 |
definitions = [
|
|
@@ -108,29 +107,28 @@ for i, definition in enumerate(definitions):
|
|
| 108 |
|
| 109 |
|
| 110 |
### Training Data
|
| 111 |
-
- **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page
|
| 112 |
-
- **
|
| 113 |
-
|
| 114 |
-
- **Max Sequence Length:** 256 tokens (both mentions and descriptions)
|
| 115 |
|
| 116 |
### Training Details
|
| 117 |
- **Hardware:** Single 80GB H100 GPU
|
| 118 |
- **Batch Size:** 80
|
| 119 |
- **Learning Rate:** 1e-5 with cosine scheduler
|
| 120 |
- **Loss Function:** Batch hard triplet loss (margin=0.4)
|
| 121 |
-
- **
|
| 122 |
|
| 123 |
## Performance
|
| 124 |
|
| 125 |
### Benchmark Results
|
| 126 |
-
- **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
|
| 127 |
- **Metric:** Recall@64
|
| 128 |
- **Score:** 80.29%
|
| 129 |
-
- **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
|
| 130 |
- **Conclusion:** Our model has strong zero-shot performance
|
| 131 |
|
| 132 |
### Usage Recommendations
|
| 133 |
-
- **Similarity Threshold:** 0.7 for positive matches
|
| 134 |
|
| 135 |
|
| 136 |
## Citation
|
|
|
|
| 10 |
|
| 11 |
## Model Description
|
| 12 |
|
| 13 |
+
**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a [bi-encoder](https://arxiv.org/pdf/1811.08008) for [entity linking](https://en.wikipedia.org/wiki/Entity_linking) tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
|
| 14 |
|
| 15 |
## Intended Uses
|
| 16 |
|
| 17 |
### Primary Use Cases
|
| 18 |
+
- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed the entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
|
| 19 |
- **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
|
| 20 |
- **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
|
| 21 |
+
- **Notes:** You may use the model as a top-k retriever and do the final disambiguation with a more powerful model for classification
|
| 22 |
|
| 23 |
### Recommended Preprocessing
|
| 24 |
+
- Use `[ENT]` tokens to mark an entity mention: `left context [ENT] mention [ENT] right context`
|
| 25 |
+
- Consider using an NER model to identify candidate mentions
|
| 26 |
+
- For non-standard entities (e.g., "daytime"), you might extract noun phrases with NLTK or spaCy for example to locate candidate mentions
|
|
|
|
| 27 |
|
| 28 |
## Code Example
|
| 29 |
|
|
|
|
| 52 |
# Verify the special token is there
|
| 53 |
print('[ENT]' in tokenizer.get_added_vocab())
|
| 54 |
|
|
|
|
| 55 |
context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
|
| 56 |
|
| 57 |
definitions = [
|
|
|
|
| 107 |
|
| 108 |
|
| 109 |
### Training Data
|
| 110 |
+
- **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page abstracts, derived from [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia)
|
| 111 |
+
- **Special Token:** `[ENT]` token added to vocabulary mark entity mentions
|
| 112 |
+
|
|
|
|
| 113 |
|
| 114 |
### Training Details
|
| 115 |
- **Hardware:** Single 80GB H100 GPU
|
| 116 |
- **Batch Size:** 80
|
| 117 |
- **Learning Rate:** 1e-5 with cosine scheduler
|
| 118 |
- **Loss Function:** Batch hard triplet loss (margin=0.4)
|
| 119 |
+
- **Max Sequence Length:** 256 tokens (both mentions and descriptions)
|
| 120 |
|
| 121 |
## Performance
|
| 122 |
|
| 123 |
### Benchmark Results
|
| 124 |
+
- **Dataset:** Zero-Shot Entity Linking [(Logeswaran et al., 2019)](https://arxiv.org/abs/1906.07348)
|
| 125 |
- **Metric:** Recall@64
|
| 126 |
- **Score:** 80.29%
|
| 127 |
+
- **Comparison:** [Meta AI's BLINK](https://arxiv.org/pdf/1911.03814) achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
|
| 128 |
- **Conclusion:** Our model has strong zero-shot performance
|
| 129 |
|
| 130 |
### Usage Recommendations
|
| 131 |
+
- **Similarity Threshold:** If using our model as a classifier, 0.7 for positive matches appears to be a reasonable threshold
|
| 132 |
|
| 133 |
|
| 134 |
## Citation
|