|
|
--- |
|
|
id: CardioNER.nl_128xtokenWindow |
|
|
name: CardioNER.nl_128xtokenWindow |
|
|
description: >- |
|
|
CardioBERTa.nl_clinical finetuned for multilabel NER task with tokenwindow of |
|
|
128 |
|
|
license: gpl-3.0 |
|
|
language: nl |
|
|
tags: |
|
|
- lexical semantic |
|
|
- span classification |
|
|
- science |
|
|
- biology |
|
|
- clinical ner |
|
|
- biomedical |
|
|
- ner,medical |
|
|
- bionlp |
|
|
base_model: UMCU/CardioBERTa.nl_clinical |
|
|
pipeline_tag: token-classification |
|
|
datasets: |
|
|
- DT4H/CardioCCC |
|
|
- UMCU/cardioccc_dutch |
|
|
--- |
|
|
|
|
|
# Model Card for Cardioner.nl 128 |
|
|
|
|
|
This a UMCU/CardioBERTa.nl_clinical base model finetuned for span classification. For this model |
|
|
we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions |
|
|
over sequences. This specific model is trained on a batch of about 500 span-labeled documents. |
|
|
|
|
|
This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter. |
|
|
|
|
|
The training was performed with 10 fold CV, with weight averaging of the best epochs per fold. |
|
|
|
|
|
|
|
|
### Expected input and output |
|
|
The input should be a string with **Dutch** clinical text related to **cardiology**. |
|
|
|
|
|
CardioNER.nl_128 is a multiclass span classification model. |
|
|
The classes that can be predicted are |
|
|
* **procedure**, |
|
|
* **medication**, |
|
|
* **disease**, |
|
|
* **symptom**. |
|
|
|
|
|
#### Extracting span classification from CardioNER.nl_128xtokenWindow |
|
|
|
|
|
The following script converts a string of <128 tokens to a list of span predictions. |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
le_pipe = pipeline('ner', |
|
|
model=model, |
|
|
tokenizer=model, aggregation_strategy="simple", |
|
|
device=-1) |
|
|
|
|
|
named_ents = le_pipe(SOME_TEXT) |
|
|
``` |
|
|
|
|
|
To process a string of *arbitrary length* you can split the string into sentences or paragraphs |
|
|
using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe. |
|
|
You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None; |
|
|
```python |
|
|
named_ents = le_pipe(SOME_TEXT, stride=256) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
# Data description |
|
|
|
|
|
CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom |
|
|
|
|
|
|
|
|
# Acknowledgement |
|
|
|
|
|
This is part of the [DT4H project](https://www.datatools4heart.eu/). |
|
|
|
|
|
# Doi and reference |
|
|
|
|
|
|
|
|
|
|
|
For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER). |
|
|
and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/) |