Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.DS_Store +0 -0
README.md +82 -0
config.json +20 -0
flax_model.msgpack +3 -0
gitattributes +10 -0
model.safetensors +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- biomedical
+- lexical semantics
+- bionlp
+- biology
+- science
+- embedding
+- entity linking
+---
+---
+datasets:
+- UMLS
+**[news]** A cross-lingual extension of SapBERT will appear in the main onference of **ACL 2021**! <br>
+**[news]** SapBERT will appear in the conference proceedings of **NAACL 2021**!
+### SapBERT-PubMedBERT
+SapBERT by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). Trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AA (English only), using [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) as the base model.
+### Expected input and output
+The input should be a string of biomedical entity names, e.g., "covid infection" or "Hydroxychloroquine". The [CLS] embedding of the last layer is regarded as the output.
+#### Extracting embeddings from SapBERT
+The following script converts a list of strings (entity names) into embeddings.
+```python
+import numpy as np
+import torch
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
+model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
+# replace with your own list of entity names
+all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
+bs = 128 # batch size during inference
+all_embs = []
+for i in tqdm(np.arange(0, len(all_names), bs)):
+    toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
+                                       padding="max_length",
+                                       max_length=25,
+                                       truncation=True,
+                                       return_tensors="pt")
+    toks_cuda = {}
+    for k,v in toks.items():
+        toks_cuda[k] = v.cuda()
+    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
+    all_embs.append(cls_rep.cpu().detach().numpy())
+all_embs = np.concatenate(all_embs, axis=0)
+```
+For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
+### Citation
+```bibtex
+@inproceedings{liu-etal-2021-self,
+    title = "Self-Alignment Pretraining for Biomedical Entity Representations",
+    author = "Liu, Fangyu  and
+      Shareghi, Ehsan  and
+      Meng, Zaiqiao  and
+      Basaldella, Marco  and
+      Collier, Nigel",
+    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
+    month = jun,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2021.naacl-main.334",
+    pages = "4228--4238",
+    abstract = "Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.",
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 30522
+}

flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5b447ad22a7d0819184ca80f3bc49fad6299f0bcd2c17d2f3a9754ad182ab2d
+size 437936109

gitattributes ADDED Viewed

	@@ -0,0 +1,10 @@

+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4696930afef9aab296196d3d2142216c44cba24f21b4f285ceca7af21025614
+size 437955508

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07f7672c7ac852d8efff83e4a7a63985bf50c03d5b57f6a7909c11fe66532137
+size 438012727

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18d43f8c7805227e0fdde72cb32acbd90a24b9cb44d4908f2a141aff09c551c3
+size 438190872

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "special_tokens_map_file": null, "full_tokenizer_file": null, "tokenizer_file": null}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff