Sentence Similarity
Safetensors
roberta
zdanGL commited on
Commit
95bd3b1
·
verified ·
1 Parent(s): b974b37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -15
README.md CHANGED
@@ -10,20 +10,20 @@ base_model:
10
 
11
  ## Model Description
12
 
13
- **roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a bi-encoder for entity linking tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
14
 
15
  ## Intended Uses
16
 
17
  ### Primary Use Cases
18
- - **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed their entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
19
  - **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
20
  - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
 
21
 
22
  ### Recommended Preprocessing
23
- - Use `[ENT]` tokens to mark entity mentions: `left context [ENT] mention [ENT] right context`
24
- - Consider using NER models to identify candidate mentions
25
- - For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
26
- - Clean and filter knowledge base entries to remove irrelevant concepts
27
 
28
  ## Code Example
29
 
@@ -52,7 +52,6 @@ model.to(device)
52
  # Verify the special token is there
53
  print('[ENT]' in tokenizer.get_added_vocab())
54
 
55
-
56
  context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
57
 
58
  definitions = [
@@ -108,29 +107,28 @@ for i, definition in enumerate(definitions):
108
 
109
 
110
  ### Training Data
111
- - **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
112
- - **Source:** Wikipedia anchor links paired with first few hundred words of target pages
113
- - **Special Token:** `[ENT]` token added to mark entity mentions
114
- - **Max Sequence Length:** 256 tokens (both mentions and descriptions)
115
 
116
  ### Training Details
117
  - **Hardware:** Single 80GB H100 GPU
118
  - **Batch Size:** 80
119
  - **Learning Rate:** 1e-5 with cosine scheduler
120
  - **Loss Function:** Batch hard triplet loss (margin=0.4)
121
- - **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
122
 
123
  ## Performance
124
 
125
  ### Benchmark Results
126
- - **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
127
  - **Metric:** Recall@64
128
  - **Score:** 80.29%
129
- - **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
130
  - **Conclusion:** Our model has strong zero-shot performance
131
 
132
  ### Usage Recommendations
133
- - **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
134
 
135
 
136
  ## Citation
 
10
 
11
  ## Model Description
12
 
13
+ **roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a [bi-encoder](https://arxiv.org/pdf/1811.08008) for [entity linking](https://en.wikipedia.org/wiki/Entity_linking) tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
14
 
15
  ## Intended Uses
16
 
17
  ### Primary Use Cases
18
+ - **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed the entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
19
  - **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
20
  - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
21
+ - **Notes:** You may use the model as a top-k retriever and do the final disambiguation with a more powerful model for classification
22
 
23
  ### Recommended Preprocessing
24
+ - Use `[ENT]` tokens to mark an entity mention: `left context [ENT] mention [ENT] right context`
25
+ - Consider using an NER model to identify candidate mentions
26
+ - For non-standard entities (e.g., "daytime"), you might extract noun phrases with NLTK or spaCy for example to locate candidate mentions
 
27
 
28
  ## Code Example
29
 
 
52
  # Verify the special token is there
53
  print('[ENT]' in tokenizer.get_added_vocab())
54
 
 
55
  context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
56
 
57
  definitions = [
 
107
 
108
 
109
  ### Training Data
110
+ - **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page abstracts, derived from [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia)
111
+ - **Special Token:** `[ENT]` token added to vocabulary mark entity mentions
112
+
 
113
 
114
  ### Training Details
115
  - **Hardware:** Single 80GB H100 GPU
116
  - **Batch Size:** 80
117
  - **Learning Rate:** 1e-5 with cosine scheduler
118
  - **Loss Function:** Batch hard triplet loss (margin=0.4)
119
+ - **Max Sequence Length:** 256 tokens (both mentions and descriptions)
120
 
121
  ## Performance
122
 
123
  ### Benchmark Results
124
+ - **Dataset:** Zero-Shot Entity Linking [(Logeswaran et al., 2019)](https://arxiv.org/abs/1906.07348)
125
  - **Metric:** Recall@64
126
  - **Score:** 80.29%
127
+ - **Comparison:** [Meta AI's BLINK](https://arxiv.org/pdf/1911.03814) achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
128
  - **Conclusion:** Our model has strong zero-shot performance
129
 
130
  ### Usage Recommendations
131
+ - **Similarity Threshold:** If using our model as a classifier, 0.7 for positive matches appears to be a reasonable threshold
132
 
133
 
134
  ## Citation