init

Files changed (7) hide show

README.md +44 -0
config.json +24 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
trainer_state.json +0 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,47 @@
 ---
 license: mit
 ---

 ---
 license: mit
+tags:
+- feature-extraction
+language: en
 ---
+# PubMedNCL
+A pretrained language model for document representations of biomedical papers.
+PubMedNCL is based on [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext), which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by [SciNCL](https://huggingface.co/malteos/scincl).
+## How to use the pretrained model
+```python
+from transformers import AutoTokenizer, AutoModel
+# load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL')
+model = AutoModel.from_pretrained('malteos/PubMedNCL')
+papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
+          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
+# concatenate title and abstract with [SEP] token
+title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
+# preprocess the input
+inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
+# inference
+result = model(**inputs)
+# take the first token ([CLS] token) in the batch as the embedding
+embeddings = result.last_hidden_state[:, 0, :]
+```
+## Citation
+- [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671).
+- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779).
+## License
+MIT

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "_name_or_path": "data/s2orc_with_specter_without_scidocs/specter/corpus_seed_0/seed_0_ep5knn20-25_en3random_without_knn_hn2knn3998-4000/model_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.5.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee39908b91b5dbf93aa8859ca9e140f7b087f3c09ae05250b45628301dec191b
+size 438012727

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "data/s2orc_with_specter_without_scidocs/specter/corpus_seed_0/seed_0_ep5knn20-25_en3random_without_knn_hn2knn3998-4000/model_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", "do_basic_tokenize": true, "never_split": null}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff