| | --- |
| | license: mit |
| | tags: |
| | - feature-extraction |
| | language: en |
| | --- |
| | |
| | # PubMedNCL |
| |
|
| | A pretrained language model for document representations of biomedical papers. |
| | PubMedNCL is based on [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext), which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by [SciNCL](https://huggingface.co/malteos/scincl). |
| |
|
| | ## How to use the pretrained model |
| |
|
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | # load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL') |
| | model = AutoModel.from_pretrained('malteos/PubMedNCL') |
| | |
| | papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, |
| | {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] |
| | |
| | # concatenate title and abstract with [SEP] token |
| | title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] |
| | |
| | # preprocess the input |
| | inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512) |
| | |
| | # inference |
| | result = model(**inputs) |
| | |
| | # take the first token ([CLS] token) in the batch as the embedding |
| | embeddings = result.last_hidden_state[:, 0, :] |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | - [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671). |
| | - [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779). |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|
| |
|