Sentence Similarity
Safetensors
roberta
zdanGL commited on
Commit
b974b37
·
verified ·
1 Parent(s): 0d6ac39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -46
README.md CHANGED
@@ -20,39 +20,11 @@ base_model:
20
  - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
21
 
22
  ### Recommended Preprocessing
23
- - Use `[ENT]` tokens to mark entity mentions: `[ENT] mention [ENT]`
24
  - Consider using NER models to identify candidate mentions
25
  - For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
26
  - Clean and filter knowledge base entries to remove irrelevant concepts
27
 
28
- ## Model Details
29
-
30
-
31
- ### Training Data
32
- - **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
33
- - **Source:** Wikipedia anchor links paired with first few hundred words of target pages
34
- - **Special Token:** `[ENT]` token added to mark entity mentions
35
- - **Max Sequence Length:** 256 tokens (both mentions and descriptions)
36
-
37
- ### Training Details
38
- - **Hardware:** Single 80GB H100 GPU
39
- - **Batch Size:** 80
40
- - **Learning Rate:** 1e-5 with cosine scheduler
41
- - **Loss Function:** Batch hard triplet loss (margin=0.4)
42
- - **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
43
-
44
- ## Performance
45
-
46
- ### Benchmark Results
47
- - **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
48
- - **Metric:** Recall@64
49
- - **Score:** 80.29%
50
- - **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
51
- - **Conclusion:** Our model has strong zero-shot performance
52
-
53
- ### Usage Recommendations
54
- - **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
55
-
56
  ## Code Example
57
 
58
  ```python
@@ -132,34 +104,40 @@ for i, definition in enumerate(definitions):
132
  print(f"Similarity: {sim_value:.4f}\n")
133
  ```
134
 
135
- ## Input Format
136
 
137
- ### Mention Context
138
- - Mark target mentions with `[ENT]` tokens: `"Text with [ENT] entity mention [ENT] in context"`
139
- - Maximum length: 256 tokens
140
 
141
- ### Entity Descriptions
142
- - Provide entity descriptions (e.g., Wikipedia abstracts)
143
- - Maximum length: 256 tokens
 
 
144
 
145
- ## Limitations and Biases
 
 
 
 
 
 
 
146
 
147
- - **Language:** English only
148
- - **Domain:** Primarily trained on Wikipedia data
149
- - **Bias:** May inherit biases present in Wikipedia content
150
- - **Performance:** Slightly lower than supervised models on in-domain tasks
 
 
151
 
152
- ## References
 
153
 
154
- - Logeswaran et al. (2019). [Zero-shot Entity Linking with Efficient Long Range Sequence Modeling](https://arxiv.org/pdf/1906.07348)
155
- - Meta AI BLINK: [GitHub Repository](https://github.com/facebookresearch/BLINK)
156
- - Google's Learning Dense Representations for Entity Retrieval
157
 
158
  ## Citation
159
 
160
  ```bibtex
161
  @misc{roberta-large-entity-linking,
162
- author = {[Your Name/Organization]},
163
  title = {RoBERTa Large Entity Linking},
164
  year = {2024},
165
  publisher = {Hugging Face},
 
20
  - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
21
 
22
  ### Recommended Preprocessing
23
+ - Use `[ENT]` tokens to mark entity mentions: `left context [ENT] mention [ENT] right context`
24
  - Consider using NER models to identify candidate mentions
25
  - For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
26
  - Clean and filter knowledge base entries to remove irrelevant concepts
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Code Example
29
 
30
  ```python
 
104
  print(f"Similarity: {sim_value:.4f}\n")
105
  ```
106
 
107
+ ## Model Details
108
 
 
 
 
109
 
110
+ ### Training Data
111
+ - **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
112
+ - **Source:** Wikipedia anchor links paired with first few hundred words of target pages
113
+ - **Special Token:** `[ENT]` token added to mark entity mentions
114
+ - **Max Sequence Length:** 256 tokens (both mentions and descriptions)
115
 
116
+ ### Training Details
117
+ - **Hardware:** Single 80GB H100 GPU
118
+ - **Batch Size:** 80
119
+ - **Learning Rate:** 1e-5 with cosine scheduler
120
+ - **Loss Function:** Batch hard triplet loss (margin=0.4)
121
+ - **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
122
+
123
+ ## Performance
124
 
125
+ ### Benchmark Results
126
+ - **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
127
+ - **Metric:** Recall@64
128
+ - **Score:** 80.29%
129
+ - **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
130
+ - **Conclusion:** Our model has strong zero-shot performance
131
 
132
+ ### Usage Recommendations
133
+ - **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
134
 
 
 
 
135
 
136
  ## Citation
137
 
138
  ```bibtex
139
  @misc{roberta-large-entity-linking,
140
+ author = {[Glass, Lewis & Co.]},
141
  title = {RoBERTa Large Entity Linking},
142
  year = {2024},
143
  publisher = {Hugging Face},