|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- es |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
These model aim to recognise occupation mentions (NER) in Spanish clinical notes and to whom the occupation belongs. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
<style type="text/css"> |
|
|
.tg {border-collapse:collapse;border-spacing:0;} |
|
|
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; |
|
|
overflow:hidden;padding:10px 5px;word-break:normal;} |
|
|
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; |
|
|
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} |
|
|
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top} |
|
|
</style> |
|
|
<table class="tg"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th class="tg-c3ow">PLM Model</th> |
|
|
<th class="tg-c3ow">Learning<br>rate</th> |
|
|
<th class="tg-c3ow">Batch size</th> |
|
|
<th class="tg-c3ow">Epochs</th> |
|
|
<th class="tg-c3ow">Max<br>length</th> |
|
|
<th class="tg-c3ow">Optimizer</th> |
|
|
<th class="tg-c3ow">Max clip<br>grad norm</th> |
|
|
<th class="tg-c3ow">Epsilon</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td class="tg-c3ow">PlanTL-GOB-ES/<br>roberta-base-biomedical-es<br></td> |
|
|
<td class="tg-c3ow">2e-05</td> |
|
|
<td class="tg-c3ow">8</td> |
|
|
<td class="tg-c3ow">10</td> |
|
|
<td class="tg-c3ow">510</td> |
|
|
<td class="tg-c3ow">AdamW</td> |
|
|
<td class="tg-c3ow">1</td> |
|
|
<td class="tg-c3ow">1e-08</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
### Model Description |
|
|
|
|
|
PlanTL-GOB-ES/roberta-base-biomedical-es model was fine-tuned using MEDDOPROF corpus (Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, & Martin Krallinger. (2022). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7116201) |
|
|
|
|
|
Two models were built: A model for occupation recognition (MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08) and a model to detect to whom the profession belongs (MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08). |
|
|
|
|
|
More details about this can be found in MEDDOPROF shared task: |
|
|
Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., & Krallinger, M. (2021). Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural, 67, 243-256. |
|
|
|
|
|
- **Developed by:** Alfredo Madrid |
|
|
- **Language(s) (NLP):** Spanish |
|
|
- **License:** CC BY-SA 4.0 |
|
|
- **Finetuned from model [optional]:** PlanTL-GOB-ES/roberta-base-biomedical-es |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://huggingface.co/HCSCRheuma/Occupations |
|
|
- **Paper [optional]:** Madrid García, A. (2023). Recognition of professions in medical documentation. |
|
|
|
|
|
## Uses |
|
|
|
|
|
**Model 1** |
|
|
|
|
|
``` |
|
|
import torch |
|
|
import pandas as pd |
|
|
import numpy as np |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08") |
|
|
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08") |
|
|
``` |
|
|
|
|
|
``` |
|
|
note = "El paciente trabaja en una empresa de construccion los jueves" |
|
|
tokenized_sentence = tokenizer.encode(note, truncation=True) |
|
|
tokenized_words_ids = tokenizer(note, truncation=True) |
|
|
word_ids = tokenized_words_ids.word_ids |
|
|
input_ids = torch.tensor([tokenized_sentence]) |
|
|
with torch.no_grad(): |
|
|
output = model(input_ids) |
|
|
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2) |
|
|
tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy()[0]) |
|
|
label_indices |
|
|
``` |
|
|
|
|
|
``` |
|
|
df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"]) |
|
|
df['labels'] = df['labels'].str.replace('##', '') |
|
|
df['tokens'] = df['tokens'].map({0: 'B-PROFESION', 1: 'B-SITUACION_LABORAL', 2: 'I-SITUACION_LABORAL', 3: 'I-ACTIVIDAD', 4: 'I-PROFESION', 5: 'O', 6: 'B-ACTIVIDAD', 7: 'PAD'}) |
|
|
df = df[1:-1] |
|
|
df['relation'] = df['relation'].astype('int') |
|
|
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x)) |
|
|
df = df.groupby('relation').first() |
|
|
df |
|
|
``` |
|
|
**Output** |
|
|
| relation | labels | tokens | |
|
|
|:--------:|:-------------:|:-----------:| |
|
|
| 0 | ĠEl | O | |
|
|
| 1 | Ġpaciente | O | |
|
|
| 2 | Ġtrabaja | B-PROFESION | |
|
|
| 3 | Ġen | I-PROFESION | |
|
|
| 4 | Ġuna | I-PROFESION | |
|
|
| 5 | Ġempresa | I-PROFESION | |
|
|
| 6 | Ġde | I-PROFESION | |
|
|
| 7 | Ġconstruccion | I-PROFESION | |
|
|
| 8 | Ġlos | O | |
|
|
| 9 | Ġjueves | O | |
|
|
|
|
|
|
|
|
**Model 2** |
|
|
``` |
|
|
import torch |
|
|
import pandas as pd |
|
|
import numpy as np |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08") |
|
|
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08") |
|
|
``` |
|
|
|
|
|
``` |
|
|
note = "El paciente trabaja en una empresa de construccion los jueves" |
|
|
tokenized_sentence = tokenizer.encode(note, truncation=True) |
|
|
tokenized_words_ids = tokenizer(note, truncation=True) |
|
|
word_ids = tokenized_words_ids.word_ids |
|
|
input_ids = torch.tensor([tokenized_sentence]) |
|
|
with torch.no_grad(): |
|
|
output = model(input_ids) |
|
|
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2) |
|
|
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0]) |
|
|
label_indices |
|
|
``` |
|
|
|
|
|
``` |
|
|
df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"]) |
|
|
df['labels'] = df['labels'].str.replace('##', '') |
|
|
df['tokens'] = df['tokens'].map({0: 'B-FAMILIAR', 1: 'I-PACIENTE', 2: 'I-OTROS', 3: 'B-SANITARIO', 4: 'B-PACIENTE', 5: 'I-FAMILIAR', 6: 'O', 7: 'B-OTROS', 8: 'I-SANITARIO', 9: 'PAD'} |
|
|
) |
|
|
df = df[1:-1] |
|
|
df['relation'] = df['relation'].astype('int') |
|
|
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x)) |
|
|
df = df.groupby('relation').first() |
|
|
df |
|
|
``` |
|
|
|
|
|
**Output** |
|
|
|
|
|
| relation | labels | tokens | |
|
|
|:--------:|:-------------:|:-----------:| |
|
|
| 0 | ĠEl | O | |
|
|
| 1 | Ġpaciente | O | |
|
|
| 2 | Ġtrabaja | B-PACIENTE | |
|
|
| 3 | Ġen | I-PACIENTE | |
|
|
| 4 | Ġuna | I-PACIENTE | |
|
|
| 5 | Ġempresa | I-PACIENTE | |
|
|
| 6 | Ġde | I-PACIENTE | |
|
|
| 7 | Ġconstruccion | I-PACIENTE | |
|
|
| 8 | Ġlos | O | |
|
|
| 9 | Ġjueves | O | |
|
|
|
|
|
|