GlassLewis
/

roberta-large-entity-linking

Sentence Similarity

Safetensors

roberta

Model card Files Files and versions

xet

Community

zdanGL commited on May 28, 2025

Commit

95bd3b1

verified ·

1 Parent(s): b974b37

Update README.md

Browse files

Files changed (1) hide show

README.md +13 -15

README.md CHANGED Viewed

@@ -10,20 +10,20 @@ base_model:
 ## Model Description
-**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a bi-encoder for entity linking tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
 ## Intended Uses
 ### Primary Use Cases
-- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed their entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
 - **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
 - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
 ### Recommended Preprocessing
-- Use `[ENT]` tokens to mark entity mentions: `left context [ENT] mention [ENT] right context`
-- Consider using NER models to identify candidate mentions
-- For non-standard entities (e.g., "daytime"), extract noun phrases using NLTK or spaCy
-- Clean and filter knowledge base entries to remove irrelevant concepts
 ## Code Example
@@ -52,7 +52,6 @@ model.to(device)
 # Verify the special token is there
 print('[ENT]' in tokenizer.get_added_vocab())
 context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
 definitions = [
@@ -108,29 +107,28 @@ for i, definition in enumerate(definitions):
 ### Training Data
-- **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page descriptions
-- **Source:** Wikipedia anchor links paired with first few hundred words of target pages
-- **Special Token:** `[ENT]` token added to mark entity mentions
-- **Max Sequence Length:** 256 tokens (both mentions and descriptions)
 ### Training Details
 - **Hardware:** Single 80GB H100 GPU
 - **Batch Size:** 80
 - **Learning Rate:** 1e-5 with cosine scheduler
 - **Loss Function:** Batch hard triplet loss (margin=0.4)
-- **Inspiration:** Meta AI's BLINK and Google's "Learning Dense Representations for Entity Retrieval"
 ## Performance
 ### Benchmark Results
-- **Dataset:** Zero-Shot Entity Linking (Logeswaran et al., 2019)
 - **Metric:** Recall@64
 - **Score:** 80.29%
-- **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
 - **Conclusion:** Our model has strong zero-shot performance
 ### Usage Recommendations
-- **Similarity Threshold:** 0.7 for positive matches (based on empirical testing)
 ## Citation

 ## Model Description
+**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a [bi-encoder](https://arxiv.org/pdf/1811.08008) for [entity linking](https://en.wikipedia.org/wiki/Entity_linking) tasks. The model separately embeds mentions-in-context and entity descriptions to enable semantic matching between text mentions and knowledge base entities.
 ## Intended Uses
 ### Primary Use Cases
+- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed the entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries).
 - **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training
 - **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities
+- **Notes:** You may use the model as a top-k retriever and do the final disambiguation with a more powerful model for classification
 ### Recommended Preprocessing
+- Use `[ENT]` tokens to mark an entity mention: `left context [ENT] mention [ENT] right context`
+- Consider using an NER model to identify candidate mentions
+- For non-standard entities (e.g., "daytime"), you might extract noun phrases with NLTK or spaCy for example to locate candidate mentions
 ## Code Example
 # Verify the special token is there
 print('[ENT]' in tokenizer.get_added_vocab())
 context = "Tim Cook, [ENT] president [ENT] of Apple, is a guy who lives in California."
 definitions = [
 ### Training Data
+- **Dataset:** 3 million pairs of Wikipedia anchor text links and Wikipedia page abstracts, derived from [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia)
+- **Special Token:** `[ENT]` token added to vocabulary mark entity mentions
 ### Training Details
 - **Hardware:** Single 80GB H100 GPU
 - **Batch Size:** 80
 - **Learning Rate:** 1e-5 with cosine scheduler
 - **Loss Function:** Batch hard triplet loss (margin=0.4)
+- **Max Sequence Length:** 256 tokens (both mentions and descriptions)
 ## Performance
 ### Benchmark Results
+- **Dataset:** Zero-Shot Entity Linking [(Logeswaran et al., 2019)](https://arxiv.org/abs/1906.07348)
 - **Metric:** Recall@64
 - **Score:** 80.29%
+- **Comparison:** [Meta AI's BLINK](https://arxiv.org/pdf/1911.03814) achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not.
 - **Conclusion:** Our model has strong zero-shot performance
 ### Usage Recommendations
+- **Similarity Threshold:** If using our model as a classifier, 0.7 for positive matches appears to be a reasonable threshold
 ## Citation