fonshartendorp
/

dutch_biomedical_entity_linking

Feature Extraction

Biomedical entity linking

representation learning

text-embeddings-inference

Model card Files Files and versions

dutch_biomedical_entity_linking / README.md

fonshartendorp's picture

Update README.md

8dae23f about 2 years ago

|

history blame contribute delete

2.73 kB

	---
	language:
	- nl
	tags:
	- Biomedical entity linking
	- sapBERT
	- bioNLP
	- embeddings
	- representation learning
	---
	## Dutch Biomedical Entity Linking

	### Summary
	- RoBERTa-based basemodel that is trained from scratch on Dutch hospital notes ([medRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl)).
	- 2nd-phase pretrained using [self-alignment](https://doi.org/10.48550/arXiv.2010.11784) on UMLS-derived Dutch biomedical ontology.
	- fine-tuned on automatically generated weakly labelled corpus from Wikipedia.
	- evaluation results on [Mantra GSC](https://doi.org/10.1093/jamia/ocv037) corpus can be found in the [report](https://github.com/fonshartendorp/dutch_biomedical_entity_linking/blob/main/report/report.pdf)


	All code for generating the training data, training the model and evaluating it, can be found in the [github](https://github.com/fonshartendorp/dutch_biomedical_entity_linking) repository.

	### Usage

	The following script (reused the original [sapBERT repository](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext?text=kidney)) computes the embeddings for a list of input entities (strings)

	```
	import numpy as np
	import torch
	from tqdm.auto import tqdm
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking)")
	model = AutoModel.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking").cuda()

	# replace with your own list of entity names
	dutch_biomedical_entities = ["versnelde ademhaling", "Coronavirus infectie", "aandachtstekort/hyperactiviteitstoornis", "hartaanval"]

	bs = 128 # batch size during inference
	all_embs = []
	for i in tqdm(np.arange(0, len(dutch_biomedical_entities), bs)):
	toks = tokenizer.batch_encode_plus(dutch_biomedical_entities[i:i+bs],
	padding="max_length",
	max_length=25,
	truncation=True,
	return_tensors="pt")
	toks_cuda = {}
	for k,v in toks.items():
	toks_cuda[k] = v.cuda()
	cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
	all_embs.append(cls_rep.cpu().detach().numpy())

	all_embs = np.concatenate(all_embs, axis=0)
	```

	For (Dutch) biomedical entity linking, the following steps should be performed:

	1. Request UMLS (and SNOMED NL) license
	2. Precompute embeddings for all entities in the UMLS with the fine-tuned model
	3. Compute embedding of the new, unseen mention with the fine-tuned model
	4. Perform nearest-neighbour search (or search FAISS-index) for linking the embedding of the new mention to its most similar embedding from the UMLS