allenai
/

specter2_adhoc_query

Adapters

bert

Model card Files Files and versions

xet

Community

aps6992 commited on Dec 14, 2023

Commit

c9bef72

1 Parent(s): 3bf83fa

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -22

README.md CHANGED Viewed

@@ -5,12 +5,17 @@ tags:
 datasets:
 - allenai/scirepeval
 ---
-# Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
-An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
-This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
 **Dec 2023 Update:**
@@ -28,19 +33,14 @@ Model usage updated to be compatible with latest versions of transformers and ad
    However, for benchmarking purposes, please continue using the current version.**
-## SPECTER2
-<!-- Provide a quick summary of what the model is/does. -->
-SPECTER2 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
-This is the base model to be used along with the adapters.
-Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
-**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
-**To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
-## Usage
 First, install `adapters`:
@@ -79,7 +79,6 @@ Task Formats trained on:
 It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
 - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
 - **Shared by :** Allen AI
 - **Model type:** bert-base-uncased + adapters
@@ -112,6 +111,16 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
 ```python
 from transformers import AutoTokenizer
 from adapters import AutoAdapterModel
 # load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
@@ -119,20 +128,22 @@ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
 #load base model
 model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
-#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
 model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
 papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
           {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 # concatenate title and abstract
-text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
-# preprocess the input
-inputs = self.tokenizer(text_batch, padding=True, truncation=True,
-                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
-output = model(**inputs)
-# take the first token in the batch as the embedding
-embeddings = output.last_hidden_state[:, 0, :]
 ```
 ## Downstream Use

 datasets:
 - allenai/scirepeval
 ---
+## SPECTER2
+<!-- Provide a quick summary of what the model is/does. -->
+SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
+Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
+**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
+**To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
 **Dec 2023 Update:**
    However, for benchmarking purposes, please continue using the current version.**
+# Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
+An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
+This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
+## Adapter Usage
 First, install `adapters`:
 It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
 - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
 - **Shared by :** Allen AI
 - **Model type:** bert-base-uncased + adapters
 ```python
 from transformers import AutoTokenizer
 from adapters import AutoAdapterModel
+from sklearn.metrics.pairwise import euclidean_distances
+def embed_input(text_batch: List[str]):
+  # preprocess the input
+  inputs = self.tokenizer(text_batch, padding=True, truncation=True,
+                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
+  output = model(**inputs)
+  # take the first token in the batch as the embedding
+  embeddings = output.last_hidden_state[:, 0, :]
+  return embeddings
 # load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
 #load base model
 model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
+#load the query adapter, provide an identifier for the adapter in load_as argument and activate it
 model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
+query = ["Bidirectional transformers"]
+query_embedding = embed_input(query)
+#load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
+model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2", set_active=True)
 papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
           {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 # concatenate title and abstract
+text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
+paper_embeddings = embed_input(text_papers_batch)
+#Calculate L2 distance between query and papers
+l2_distance = euclidean_distances(papers, query).flatten()
 ```
 ## Downstream Use