| | --- |
| | tags: |
| | - adapter-transformers |
| | - bert |
| | datasets: |
| | - allenai/scirepeval |
| | --- |
| | |
| | ## SPECTER2 |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| |
|
| | SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_). |
| | Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. |
| |
|
| | **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).** |
| |
|
| | **To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.** |
| |
|
| | **Dec 2023 Update:** |
| |
|
| | Model usage updated to be compatible with latest versions of transformers and adapters (newly released update to adapter-transformers) libraries. |
| |
|
| |
|
| | **Aug 2023 Update:** |
| | 1. **The SPECTER2 Base and proximity adapter models have been renamed in Hugging Face based upon usage patterns as follows:** |
| |
|
| | |Old Name|New Name| |
| | |--|--| |
| | |allenai/specter2|[allenai/specter2_base](https://huggingface.co/allenai/specter2_base)| |
| | |allenai/specter2_proximity|[allenai/specter2](https://huggingface.co/allenai/specter2)| |
| | |
| | 2. **We have a parallel version (termed [aug2023refresh](https://huggingface.co/allenai/specter2_aug2023refresh)) where the base transformer encoder version is pre-trained on a collection of newer papers (published after 2018). |
| | However, for benchmarking purposes, please continue using the current version.** |
| | |
| | |
| | # Adapter `allenai/specter2_regression` for allenai/specter2_base |
| | |
| | An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset. |
| | |
| | This adapter was created for usage with the **[adapter-transformers](https://github.com/adapter-hub/adapters)** library. |
| | |
| | ## Adapter Usage |
| | |
| | First, install `adapters`: |
| | |
| | ``` |
| | pip install -U adapters |
| | ``` |
| | _Note: adapters is built as an add on to transformers and acts as a drop-in replacement with adapter support. [More](https://docs.adapterhub.ml)_ |
| |
|
| | Now, the adapter can be loaded and activated like this: |
| |
|
| | ```python |
| | from adapters import AutoAdapterModel |
| | |
| | model = AutoAdapterModel.from_pretrained("allenai/specter2_base") |
| | adapter_name = model.load_adapter("allenai/specter2_regression", source="hf", set_active=True) |
| | ``` |
| |
|
| |
|
| | # Model Details |
| |
|
| | ## Model Description |
| |
|
| | SPECTER2 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation). |
| | Post that it is trained with additionally attached task format specific adapter modules on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks. |
| |
|
| | Task Formats trained on: |
| | - Classification |
| | - Regression |
| | - Proximity (Retrieval) |
| | - Adhoc Search |
| |
|
| | **This is the regression specific adapter. For generating embeddings which can be used as input to downstream regression models like SVRs to generate a continuous value as the result.** |
| |
|
| | |
| | It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well. |
| |
|
| |
|
| |
|
| | - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman |
| | - **Shared by :** Allen AI |
| | - **Model type:** bert-base-uncased + adapters |
| | - **License:** Apache 2.0 |
| | - **Finetuned from model:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased). |
| |
|
| | ## Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [https://github.com/allenai/SPECTER2](https://github.com/allenai/SPECTER2) |
| | - **Paper:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137) |
| | - **Demo:** [Usage](https://github.com/allenai/SPECTER2/blob/main/README.md) |
| |
|
| | # Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
|
| | ## Direct Use |
| |
|
| | |Model|Name and HF link|Description| |
| | |--|--|--| |
| | |Proximity*|[allenai/specter2](https://huggingface.co/allenai/specter2)|Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search| |
| | |Adhoc Query|[allenai/specter2_adhoc_query](https://huggingface.co/allenai/specter2_adhoc_query)|Encode short raw text queries for search tasks. (Candidate papers can be encoded with the proximity adapter)| |
| | |Classification|[allenai/specter2_classification](https://huggingface.co/allenai/specter2_classification)|Encode papers to feed into linear classifiers as features| |
| | |Regression|[allenai/specter2_regression](https://huggingface.co/allenai/specter2_regression)|Encode papers to feed into linear regressors as features| |
| | |
| | *Proximity model should suffice for downstream task types not mentioned above |
| |
|
| | ```python |
| | from transformers import AutoTokenizer |
| | from adapters import AutoAdapterModel |
| | |
| | # load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base') |
| | |
| | #load base model |
| | model = AutoAdapterModel.from_pretrained('allenai/specter2_base') |
| | |
| | #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it |
| | model.load_adapter("allenai/specter2_regression", source="hf", load_as="regression", set_active=True) |
| | |
| | papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, |
| | {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] |
| | |
| | # concatenate title and abstract |
| | text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] |
| | # preprocess the input |
| | inputs = self.tokenizer(text_batch, padding=True, truncation=True, |
| | return_tensors="pt", return_token_type_ids=False, max_length=512) |
| | output = model(**inputs) |
| | # take the first token in the batch as the embedding |
| | embeddings = output.last_hidden_state[:, 0, :] |
| | ``` |
| |
|
| | ## Downstream Use |
| |
|
| | <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
| |
|
| | For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md). |
| |
|
| | # Training Details |
| |
|
| | ## Training Data |
| |
|
| | <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
| |
|
| | The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats. |
| | All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval). |
| |
|
| | The citation link are triplets in the form |
| |
|
| | ```json |
| | {"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}} |
| | ``` |
| |
|
| | consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation. |
| |
|
| | ## Training Procedure |
| |
|
| | Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677). |
| |
|
| |
|
| | ### Training Hyperparameters |
| |
|
| |
|
| | The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md): |
| | - Base Model: First a base model is trained on the above citation triplets. |
| | ``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16``` |
| | - Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well. |
| | ``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16``` |
| |
|
| |
|
| | # Evaluation |
| |
|
| | We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset. |
| | We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark. |
| |
|
| | |Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)| |
| | |--|--|--|--|--| |
| | |[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)| |
| | |[SPECTER](https://huggingface.co/allenai/specter)|54.7|72.0|67.5|(30.6, 25.5)| |
| | |[SciNCL](https://huggingface.co/malteos/scincl)|55.6|73.4|68.8|(32.6, 27.3)| |
| | |[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|73.8|70.7|(35.3, 29.6)| |
| | |[SPECTER2 Base](allenai/specter2_base)|56.3|73.6|69.1|(38.0, 32.4)| |
| | |[SPECTER2-Adapters](https://huggingface.co/models?search=allenai/specter-2)|**62.3**|**74.1**|**71.1**|**(38.4, 33.0)**| |
| |
|
| | Please cite the following works if you end up using SPECTER2: |
| |
|
| | ``` |
| | [SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137) |
| | ```bibtex |
| | @inproceedings{Singh2022SciRepEvalAM, |
| | title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations}, |
| | author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman}, |
| | booktitle={Conference on Empirical Methods in Natural Language Processing}, |
| | year={2022}, |
| | url={https://api.semanticscholar.org/CorpusID:254018137} |
| | } |
| | ``` |
| | |
| | |
| | |