Occupations / README.md

Update README.md

92bdacc over 2 years ago

6.8 kB

	---
	license: cc-by-4.0
	language:
	- es
	pipeline_tag: token-classification
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	These model aim to recognise occupation mentions (NER) in Spanish clinical notes and to whom the occupation belongs.

	## Model Details

	<style type="text/css">
	.tg {border-collapse:collapse;border-spacing:0;}
	.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
	overflow:hidden;padding:10px 5px;word-break:normal;}
	.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
	font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
	.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
	</style>
	<table class="tg">
	<thead>
	<tr>
	<th class="tg-c3ow">PLM Model</th>
	<th class="tg-c3ow">Learning<br>rate</th>
	<th class="tg-c3ow">Batch size</th>
	<th class="tg-c3ow">Epochs</th>
	<th class="tg-c3ow">Max<br>length</th>
	<th class="tg-c3ow">Optimizer</th>
	<th class="tg-c3ow">Max clip<br>grad norm</th>
	<th class="tg-c3ow">Epsilon</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td class="tg-c3ow">PlanTL-GOB-ES/<br>roberta-base-biomedical-es<br></td>
	<td class="tg-c3ow">2e-05</td>
	<td class="tg-c3ow">8</td>
	<td class="tg-c3ow">10</td>
	<td class="tg-c3ow">510</td>
	<td class="tg-c3ow">AdamW</td>
	<td class="tg-c3ow">1</td>
	<td class="tg-c3ow">1e-08</td>
	</tr>
	</tbody>
	</table>

	### Model Description

	PlanTL-GOB-ES/roberta-base-biomedical-es model was fine-tuned using MEDDOPROF corpus (Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, & Martin Krallinger. (2022). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7116201)

	Two models were built: A model for occupation recognition (MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08) and a model to detect to whom the profession belongs (MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08).

	More details about this can be found in MEDDOPROF shared task:
	Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., & Krallinger, M. (2021). Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural, 67, 243-256.

	- Developed by: Alfredo Madrid
	- Language(s) (NLP): Spanish
	- License: CC BY-SA 4.0
	- Finetuned from model [optional]: PlanTL-GOB-ES/roberta-base-biomedical-es

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://huggingface.co/HCSCRheuma/Occupations
	- Paper [optional]: Madrid García, A. (2023). Recognition of professions in medical documentation.

	## Uses

	Model 1

	```
	import torch
	import pandas as pd
	import numpy as np

	from transformers import AutoTokenizer, AutoModelForTokenClassification
	model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")
	tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")
	```

	```
	note = "El paciente trabaja en una empresa de construccion los jueves"
	tokenized_sentence = tokenizer.encode(note, truncation=True)
	tokenized_words_ids = tokenizer(note, truncation=True)
	word_ids = tokenized_words_ids.word_ids
	input_ids = torch.tensor([tokenized_sentence])
	with torch.no_grad():
	output = model(input_ids)
	label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
	tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy()[0])
	label_indices
	```

	```
	df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
	df['labels'] = df['labels'].str.replace('##', '')
	df['tokens'] = df['tokens'].map({0: 'B-PROFESION', 1: 'B-SITUACION_LABORAL', 2: 'I-SITUACION_LABORAL', 3: 'I-ACTIVIDAD', 4: 'I-PROFESION', 5: 'O', 6: 'B-ACTIVIDAD', 7: 'PAD'})
	df = df[1:-1]
	df['relation'] = df['relation'].astype('int')
	df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
	df = df.groupby('relation').first()
	df
	```
	Output
	\| relation \| labels \| tokens \|
	\|:--------:\|:-------------:\|:-----------:\|
	\| 0 \| ĠEl \| O \|
	\| 1 \| Ġpaciente \| O \|
	\| 2 \| Ġtrabaja \| B-PROFESION \|
	\| 3 \| Ġen \| I-PROFESION \|
	\| 4 \| Ġuna \| I-PROFESION \|
	\| 5 \| Ġempresa \| I-PROFESION \|
	\| 6 \| Ġde \| I-PROFESION \|
	\| 7 \| Ġconstruccion \| I-PROFESION \|
	\| 8 \| Ġlos \| O \|
	\| 9 \| Ġjueves \| O \|


	Model 2
	```
	import torch
	import pandas as pd
	import numpy as np

	from transformers import AutoTokenizer, AutoModelForTokenClassification
	model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")
	tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")
	```

	```
	note = "El paciente trabaja en una empresa de construccion los jueves"
	tokenized_sentence = tokenizer.encode(note, truncation=True)
	tokenized_words_ids = tokenizer(note, truncation=True)
	word_ids = tokenized_words_ids.word_ids
	input_ids = torch.tensor([tokenized_sentence])
	with torch.no_grad():
	output = model(input_ids)
	label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
	tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
	label_indices
	```

	```
	df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
	df['labels'] = df['labels'].str.replace('##', '')
	df['tokens'] = df['tokens'].map({0: 'B-FAMILIAR', 1: 'I-PACIENTE', 2: 'I-OTROS', 3: 'B-SANITARIO', 4: 'B-PACIENTE', 5: 'I-FAMILIAR', 6: 'O', 7: 'B-OTROS', 8: 'I-SANITARIO', 9: 'PAD'}
	)
	df = df[1:-1]
	df['relation'] = df['relation'].astype('int')
	df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
	df = df.groupby('relation').first()
	df
	```

	Output

	\| relation \| labels \| tokens \|
	\|:--------:\|:-------------:\|:-----------:\|
	\| 0 \| ĠEl \| O \|
	\| 1 \| Ġpaciente \| O \|
	\| 2 \| Ġtrabaja \| B-PACIENTE \|
	\| 3 \| Ġen \| I-PACIENTE \|
	\| 4 \| Ġuna \| I-PACIENTE \|
	\| 5 \| Ġempresa \| I-PACIENTE \|
	\| 6 \| Ġde \| I-PACIENTE \|
	\| 7 \| Ġconstruccion \| I-PACIENTE \|
	\| 8 \| Ġlos \| O \|
	\| 9 \| Ġjueves \| O \|