File size: 6,795 Bytes
20c8cf2
 
118e31f
 
4c2b47c
 
 
 
 
 
08c3c47
4c2b47c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08c3c47
 
4c2b47c
 
 
 
 
84165a6
4c2b47c
 
 
 
 
 
 
 
 
 
 
84165a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c2b47c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
license: cc-by-4.0
language:
- es
pipeline_tag: token-classification
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
These model aim to recognise occupation mentions (NER) in Spanish clinical notes and to whom the occupation belongs.

## Model Details

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow">PLM Model</th>
    <th class="tg-c3ow">Learning<br>rate</th>
    <th class="tg-c3ow">Batch size</th>
    <th class="tg-c3ow">Epochs</th>
    <th class="tg-c3ow">Max<br>length</th>
    <th class="tg-c3ow">Optimizer</th>
    <th class="tg-c3ow">Max clip<br>grad norm</th>
    <th class="tg-c3ow">Epsilon</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-c3ow">PlanTL-GOB-ES/<br>roberta-base-biomedical-es<br></td>
    <td class="tg-c3ow">2e-05</td>
    <td class="tg-c3ow">8</td>
    <td class="tg-c3ow">10</td>
    <td class="tg-c3ow">510</td>
    <td class="tg-c3ow">AdamW</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">1e-08</td>
  </tr>
</tbody>
</table>

### Model Description

PlanTL-GOB-ES/roberta-base-biomedical-es model was fine-tuned using MEDDOPROF corpus (Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, & Martin Krallinger. (2022). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7116201)

Two models were built: A model for occupation recognition (MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08) and a model to detect to whom the profession belongs (MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08).

More details about this can be found in MEDDOPROF shared task:
Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., & Krallinger, M. (2021). Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural, 67, 243-256.

- **Developed by:** Alfredo Madrid
- **Language(s) (NLP):** Spanish
- **License:** CC BY-SA 4.0
- **Finetuned from model [optional]:** PlanTL-GOB-ES/roberta-base-biomedical-es

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://huggingface.co/HCSCRheuma/Occupations
- **Paper [optional]:** Madrid García, A. (2023). Recognition of professions in medical documentation.

## Uses

**Model 1**

```
import torch
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")
```

```
note = "El paciente trabaja en una empresa de construccion los jueves"
tokenized_sentence = tokenizer.encode(note, truncation=True)
tokenized_words_ids = tokenizer(note, truncation=True)
word_ids = tokenized_words_ids.word_ids
input_ids = torch.tensor([tokenized_sentence])
with torch.no_grad():
    output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy()[0])
label_indices
```

```
df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
df['labels'] = df['labels'].str.replace('##', '')
df['tokens'] = df['tokens'].map({0: 'B-PROFESION', 1: 'B-SITUACION_LABORAL', 2: 'I-SITUACION_LABORAL', 3: 'I-ACTIVIDAD', 4: 'I-PROFESION', 5: 'O', 6: 'B-ACTIVIDAD', 7: 'PAD'})
df = df[1:-1]
df['relation'] = df['relation'].astype('int')
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
df = df.groupby('relation').first()
df
```
**Output**
| relation |     labels    |    tokens   |
|:--------:|:-------------:|:-----------:|
|     0    |      ĠEl      |      O      |
|     1    |   Ġpaciente   |      O      |
|     2    |    Ġtrabaja   | B-PROFESION |
|     3    |      Ġen      | I-PROFESION |
|     4    |      Ġuna     | I-PROFESION |
|     5    |    Ġempresa   | I-PROFESION |
|     6    |      Ġde      | I-PROFESION |
|     7    | Ġconstruccion | I-PROFESION |
|     8    |      Ġlos     |      O      |
|     9    |    Ġjueves    |      O      |


**Model 2**
```
import torch
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")
```

```
note = "El paciente trabaja en una empresa de construccion los jueves"
tokenized_sentence = tokenizer.encode(note, truncation=True)
tokenized_words_ids = tokenizer(note, truncation=True)
word_ids = tokenized_words_ids.word_ids
input_ids = torch.tensor([tokenized_sentence])
with torch.no_grad():
    output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
label_indices
```

```
df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
df['labels'] = df['labels'].str.replace('##', '')
df['tokens'] = df['tokens'].map({0: 'B-FAMILIAR', 1: 'I-PACIENTE', 2: 'I-OTROS', 3: 'B-SANITARIO', 4: 'B-PACIENTE', 5: 'I-FAMILIAR', 6: 'O', 7: 'B-OTROS', 8: 'I-SANITARIO', 9: 'PAD'}
)
df = df[1:-1]
df['relation'] = df['relation'].astype('int')
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
df = df.groupby('relation').first()
df
```

**Output**

| relation |     labels    |    tokens   |
|:--------:|:-------------:|:-----------:|
|     0    |      ĠEl      |      O      |
|     1    |   Ġpaciente   |      O      |
|     2    |    Ġtrabaja   | B-PACIENTE |
|     3    |      Ġen      | I-PACIENTE |
|     4    |      Ġuna     | I-PACIENTE |
|     5    |    Ġempresa   | I-PACIENTE |
|     6    |      Ġde      | I-PACIENTE |
|     7    | Ġconstruccion | I-PACIENTE |
|     8    |      Ġlos     |      O      |
|     9    |    Ġjueves    |      O      |