param2004 commited on
Commit
0bc10d1
·
verified ·
1 Parent(s): 016f89f

Upload folder using huggingface_hub

Browse files
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - biomedical
7
+ - lexical semantics
8
+ - bionlp
9
+ - biology
10
+ - science
11
+ - embedding
12
+ - entity linking
13
+ ---
14
+ ---
15
+
16
+
17
+ datasets:
18
+ - UMLS
19
+
20
+ **[news]** A cross-lingual extension of SapBERT will appear in the main onference of **ACL 2021**! <br>
21
+ **[news]** SapBERT will appear in the conference proceedings of **NAACL 2021**!
22
+
23
+ ### SapBERT-PubMedBERT
24
+ SapBERT by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). Trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AA (English only), using [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) as the base model.
25
+
26
+ ### Expected input and output
27
+ The input should be a string of biomedical entity names, e.g., "covid infection" or "Hydroxychloroquine". The [CLS] embedding of the last layer is regarded as the output.
28
+
29
+ #### Extracting embeddings from SapBERT
30
+
31
+ The following script converts a list of strings (entity names) into embeddings.
32
+ ```python
33
+ import numpy as np
34
+ import torch
35
+ from tqdm.auto import tqdm
36
+ from transformers import AutoTokenizer, AutoModel
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
39
+ model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
40
+
41
+ # replace with your own list of entity names
42
+ all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
43
+
44
+ bs = 128 # batch size during inference
45
+ all_embs = []
46
+ for i in tqdm(np.arange(0, len(all_names), bs)):
47
+ toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
48
+ padding="max_length",
49
+ max_length=25,
50
+ truncation=True,
51
+ return_tensors="pt")
52
+ toks_cuda = {}
53
+ for k,v in toks.items():
54
+ toks_cuda[k] = v.cuda()
55
+ cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
56
+ all_embs.append(cls_rep.cpu().detach().numpy())
57
+
58
+ all_embs = np.concatenate(all_embs, axis=0)
59
+ ```
60
+
61
+ For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
62
+
63
+
64
+ ### Citation
65
+ ```bibtex
66
+ @inproceedings{liu-etal-2021-self,
67
+ title = "Self-Alignment Pretraining for Biomedical Entity Representations",
68
+ author = "Liu, Fangyu and
69
+ Shareghi, Ehsan and
70
+ Meng, Zaiqiao and
71
+ Basaldella, Marco and
72
+ Collier, Nigel",
73
+ booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
74
+ month = jun,
75
+ year = "2021",
76
+ address = "Online",
77
+ publisher = "Association for Computational Linguistics",
78
+ url = "https://www.aclweb.org/anthology/2021.naacl-main.334",
79
+ pages = "4228--4238",
80
+ abstract = "Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.",
81
+ }
82
+ ```
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "type_vocab_size": 2,
19
+ "vocab_size": 30522
20
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5b447ad22a7d0819184ca80f3bc49fad6299f0bcd2c17d2f3a9754ad182ab2d
3
+ size 437936109
gitattributes ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
10
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4696930afef9aab296196d3d2142216c44cba24f21b4f285ceca7af21025614
3
+ size 437955508
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07f7672c7ac852d8efff83e4a7a63985bf50c03d5b57f6a7909c11fe66532137
3
+ size 438012727
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18d43f8c7805227e0fdde72cb32acbd90a24b9cb44d4908f2a141aff09c551c3
3
+ size 438190872
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "special_tokens_map_file": null, "full_tokenizer_file": null, "tokenizer_file": null}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff