File size: 5,900 Bytes
daf5185
 
 
dba62c6
 
 
daf5185
 
 
0b3730f
 
5744408
daf5185
0b3730f
daf5185
0b3730f
daf5185
dba62c6
 
daf5185
 
 
 
 
 
a4c902e
 
 
 
 
 
 
 
 
 
 
00cb8cf
a4c902e
daf5185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cecc391
 
785bf02
 
cecc391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5450d90
 
 
 
 
 
 
 
cecc391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
daf5185
 
 
 
 
 
 
 
 
 
 
 
a4c902e
daf5185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b0bad5
dba62c6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
id: sap_umls_MedRoBERTa.nl
name: sap_umls_MedRoBERTa.nl
description: >-
  MedRoBERTa.nl continued pre-training on hard medical terms pairs from the UMLS
  ontology, using the multi-similarity loss function
license: gpl-3.0
language: nl
tags:
- embedding
- bionlp
- biology
- science
- entity linking
- lexical semantic
- biomedical
pipeline_tag: feature-extraction
base_model:
- CLTL/MedRoBERTa.nl
---

# Model Card for Sap Umls Medroberta.Nl

The model was trained on medical entity triplets (anchor, term, synonym)

### Training specifics

```
epochs : 2
batch_size : 64
learning_rate : 5e-6
weight_decay : 1e-4
max_length : 30
loss : ms_loss
pairwise : true
type_of_triplets : all
agg_mode : CLS
```

### Expected input and output
The input should be a string of biomedical entity names, e.g., "covid infection" or "Hydroxychloroquine". The [CLS] embedding of the last layer is regarded as the output.

#### Extracting embeddings from sap_umls_MedRoBERTa.nl

The following script converts a list of strings (entity names) into embeddings.
```python
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("UMCU/sap_umls_MedRoBERTa.nl")
model = AutoModel.from_pretrained("UMCU/sap_umls_MedRoBERTa.nl").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
                                       padding="max_length",
                                       max_length=25,
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)
```

# Wrapping it in SBERT
```python
from sentence_transformers import SentenceTransformer, models
```

# 1) Define the transformer module pointing at your checkpoint
```python
word_embedding_model = models.Transformer(
    "UMCU/sap_umls_MedRoBERTa.nl",
    max_seq_length=25
)
```

# 2) Pooling: use the [CLS] token representation
```python
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(),
    pooling_mode_cls_token=True,
    pooling_mode_mean_token=False,
    pooling_mode_max_token=False
)
```

# 3) Build the SentenceTransformer
```python
sbert_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sbert_model.cuda()  # move to GPU if available
```

# 4) Save it as an SBERT model

```python
save_path = "./sap_umls_sbert"
sbert_model.save(save_path)
```


# Now you can encode your list of phrases directly:
```python
all_names = [
    "covid-19",
    "Coronavirus infection",
    "high fever",
    "Tumor of posterior wall of oropharynx",
    # …etc.
]
```

# `.encode` handles batching/padding/truncation internally:
```python
all_embs = sbert_model.encode(
    all_names,
    batch_size=128,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=False   # or True if you want unit-norm embeddings
)
```

# Data description

Hard Dutch UMLS synonym pairs (terms referring to the same CUI). Dutch UMLS extended with matching Dutch SNOMEDCT term, and including English medication names


# Acknowledgement

This is part of the [DT4H project](https://www.datatools4heart.eu/).

# Doi and reference

...


For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).


### Citation
```bibtex
@inproceedings{liu-etal-2021-self,
    title = "Self-Alignment Pretraining for Biomedical Entity Representations",
    author = "Liu, Fangyu  and
      Shareghi, Ehsan  and
      Meng, Zaiqiao  and
      Basaldella, Marco  and
      Collier, Nigel",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.334",
    pages = "4228--4238",
    abstract = "Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.",
}
```

For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER).
and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/)