CardioNER.nl_128 / README.md
UMCU's picture
Update README.md
f20d142 verified
---
id: CardioNER.nl_128xtokenWindow
name: CardioNER.nl_128xtokenWindow
description: >-
CardioBERTa.nl_clinical finetuned for multilabel NER task with tokenwindow of
128
license: gpl-3.0
language: nl
tags:
- lexical semantic
- span classification
- science
- biology
- clinical ner
- biomedical
- ner,medical
- bionlp
base_model: UMCU/CardioBERTa.nl_clinical
pipeline_tag: token-classification
datasets:
- DT4H/CardioCCC
- UMCU/cardioccc_dutch
---
# Model Card for Cardioner.nl 128
This a UMCU/CardioBERTa.nl_clinical base model finetuned for span classification. For this model
we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions
over sequences. This specific model is trained on a batch of about 500 span-labeled documents.
This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter.
The training was performed with 10 fold CV, with weight averaging of the best epochs per fold.
### Expected input and output
The input should be a string with **Dutch** clinical text related to **cardiology**.
CardioNER.nl_128 is a multiclass span classification model.
The classes that can be predicted are
* **procedure**,
* **medication**,
* **disease**,
* **symptom**.
#### Extracting span classification from CardioNER.nl_128xtokenWindow
The following script converts a string of <128 tokens to a list of span predictions.
```python
from transformers import pipeline
le_pipe = pipeline('ner',
model=model,
tokenizer=model, aggregation_strategy="simple",
device=-1)
named_ents = le_pipe(SOME_TEXT)
```
To process a string of *arbitrary length* you can split the string into sentences or paragraphs
using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe.
You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None;
```python
named_ents = le_pipe(SOME_TEXT, stride=256)
```
# Data description
CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom
# Acknowledgement
This is part of the [DT4H project](https://www.datatools4heart.eu/).
# Doi and reference
For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER).
and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/)