Update README.md
Browse files
README.md
CHANGED
|
@@ -5,12 +5,17 @@ tags:
|
|
| 5 |
datasets:
|
| 6 |
- allenai/scirepeval
|
| 7 |
---
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
|
| 14 |
|
| 15 |
**Dec 2023 Update:**
|
| 16 |
|
|
@@ -28,19 +33,14 @@ Model usage updated to be compatible with latest versions of transformers and ad
|
|
| 28 |
However, for benchmarking purposes, please continue using the current version.**
|
| 29 |
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
| 34 |
|
| 35 |
-
|
| 36 |
-
This is the base model to be used along with the adapters.
|
| 37 |
-
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
**To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
|
| 42 |
|
| 43 |
-
## Usage
|
| 44 |
|
| 45 |
First, install `adapters`:
|
| 46 |
|
|
@@ -79,7 +79,6 @@ Task Formats trained on:
|
|
| 79 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
| 80 |
|
| 81 |
|
| 82 |
-
|
| 83 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
| 84 |
- **Shared by :** Allen AI
|
| 85 |
- **Model type:** bert-base-uncased + adapters
|
|
@@ -112,6 +111,16 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
|
|
| 112 |
```python
|
| 113 |
from transformers import AutoTokenizer
|
| 114 |
from adapters import AutoAdapterModel
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
# load model and tokenizer
|
| 117 |
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
|
@@ -119,20 +128,22 @@ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
|
| 119 |
#load base model
|
| 120 |
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
|
| 121 |
|
| 122 |
-
#load the adapter
|
| 123 |
model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
|
|
|
|
|
|
|
| 124 |
|
|
|
|
|
|
|
| 125 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
| 126 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
| 127 |
-
|
| 128 |
# concatenate title and abstract
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
embeddings = output.last_hidden_state[:, 0, :]
|
| 136 |
```
|
| 137 |
|
| 138 |
## Downstream Use
|
|
|
|
| 5 |
datasets:
|
| 6 |
- allenai/scirepeval
|
| 7 |
---
|
| 8 |
+
## SPECTER2
|
| 9 |
|
| 10 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 11 |
|
| 12 |
+
SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
|
| 13 |
+
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
| 14 |
+
|
| 15 |
+
**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
|
| 16 |
+
|
| 17 |
+
**To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
|
| 18 |
|
|
|
|
| 19 |
|
| 20 |
**Dec 2023 Update:**
|
| 21 |
|
|
|
|
| 33 |
However, for benchmarking purposes, please continue using the current version.**
|
| 34 |
|
| 35 |
|
| 36 |
+
# Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
|
| 41 |
|
|
|
|
| 42 |
|
| 43 |
+
## Adapter Usage
|
| 44 |
|
| 45 |
First, install `adapters`:
|
| 46 |
|
|
|
|
| 79 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
| 80 |
|
| 81 |
|
|
|
|
| 82 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
| 83 |
- **Shared by :** Allen AI
|
| 84 |
- **Model type:** bert-base-uncased + adapters
|
|
|
|
| 111 |
```python
|
| 112 |
from transformers import AutoTokenizer
|
| 113 |
from adapters import AutoAdapterModel
|
| 114 |
+
from sklearn.metrics.pairwise import euclidean_distances
|
| 115 |
+
|
| 116 |
+
def embed_input(text_batch: List[str]):
|
| 117 |
+
# preprocess the input
|
| 118 |
+
inputs = self.tokenizer(text_batch, padding=True, truncation=True,
|
| 119 |
+
return_tensors="pt", return_token_type_ids=False, max_length=512)
|
| 120 |
+
output = model(**inputs)
|
| 121 |
+
# take the first token in the batch as the embedding
|
| 122 |
+
embeddings = output.last_hidden_state[:, 0, :]
|
| 123 |
+
return embeddings
|
| 124 |
|
| 125 |
# load model and tokenizer
|
| 126 |
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
|
|
|
| 128 |
#load base model
|
| 129 |
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
|
| 130 |
|
| 131 |
+
#load the query adapter, provide an identifier for the adapter in load_as argument and activate it
|
| 132 |
model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
|
| 133 |
+
query = ["Bidirectional transformers"]
|
| 134 |
+
query_embedding = embed_input(query)
|
| 135 |
|
| 136 |
+
#load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
|
| 137 |
+
model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2", set_active=True)
|
| 138 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
| 139 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
|
|
|
| 140 |
# concatenate title and abstract
|
| 141 |
+
text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
|
| 142 |
+
paper_embeddings = embed_input(text_papers_batch)
|
| 143 |
+
|
| 144 |
+
#Calculate L2 distance between query and papers
|
| 145 |
+
l2_distance = euclidean_distances(papers, query).flatten()
|
| 146 |
+
|
|
|
|
| 147 |
```
|
| 148 |
|
| 149 |
## Downstream Use
|