--- license: cc-by-4.0 language: - es pipeline_tag: token-classification --- # Model Card for Model ID These model aim to recognise occupation mentions (NER) in Spanish clinical notes and to whom the occupation belongs. ## Model Details
PLM Model Learning
rate
Batch size Epochs Max
length
Optimizer Max clip
grad norm
Epsilon
PlanTL-GOB-ES/
roberta-base-biomedical-es
2e-05 8 10 510 AdamW 1 1e-08
### Model Description PlanTL-GOB-ES/roberta-base-biomedical-es model was fine-tuned using MEDDOPROF corpus (Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, & Martin Krallinger. (2022). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7116201) Two models were built: A model for occupation recognition (MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08) and a model to detect to whom the profession belongs (MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08). More details about this can be found in MEDDOPROF shared task: Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., & Krallinger, M. (2021). Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural, 67, 243-256. - **Developed by:** Alfredo Madrid - **Language(s) (NLP):** Spanish - **License:** CC BY-SA 4.0 - **Finetuned from model [optional]:** PlanTL-GOB-ES/roberta-base-biomedical-es ### Model Sources - **Repository:** https://huggingface.co/HCSCRheuma/Occupations - **Paper [optional]:** Madrid García, A. (2023). Recognition of professions in medical documentation. ## Uses **Model 1** ``` import torch import pandas as pd import numpy as np from transformers import AutoTokenizer, AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08") tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08") ``` ``` note = "El paciente trabaja en una empresa de construccion los jueves" tokenized_sentence = tokenizer.encode(note, truncation=True) tokenized_words_ids = tokenizer(note, truncation=True) word_ids = tokenized_words_ids.word_ids input_ids = torch.tensor([tokenized_sentence]) with torch.no_grad(): output = model(input_ids) label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2) tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy()[0]) label_indices ``` ``` df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"]) df['labels'] = df['labels'].str.replace('##', '') df['tokens'] = df['tokens'].map({0: 'B-PROFESION', 1: 'B-SITUACION_LABORAL', 2: 'I-SITUACION_LABORAL', 3: 'I-ACTIVIDAD', 4: 'I-PROFESION', 5: 'O', 6: 'B-ACTIVIDAD', 7: 'PAD'}) df = df[1:-1] df['relation'] = df['relation'].astype('int') df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x)) df = df.groupby('relation').first() df ``` **Output** | relation | labels | tokens | |:--------:|:-------------:|:-----------:| | 0 | ĠEl | O | | 1 | Ġpaciente | O | | 2 | Ġtrabaja | B-PROFESION | | 3 | Ġen | I-PROFESION | | 4 | Ġuna | I-PROFESION | | 5 | Ġempresa | I-PROFESION | | 6 | Ġde | I-PROFESION | | 7 | Ġconstruccion | I-PROFESION | | 8 | Ġlos | O | | 9 | Ġjueves | O | **Model 2** ``` import torch import pandas as pd import numpy as np from transformers import AutoTokenizer, AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08") tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08") ``` ``` note = "El paciente trabaja en una empresa de construccion los jueves" tokenized_sentence = tokenizer.encode(note, truncation=True) tokenized_words_ids = tokenizer(note, truncation=True) word_ids = tokenized_words_ids.word_ids input_ids = torch.tensor([tokenized_sentence]) with torch.no_grad(): output = model(input_ids) label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2) tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0]) label_indices ``` ``` df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"]) df['labels'] = df['labels'].str.replace('##', '') df['tokens'] = df['tokens'].map({0: 'B-FAMILIAR', 1: 'I-PACIENTE', 2: 'I-OTROS', 3: 'B-SANITARIO', 4: 'B-PACIENTE', 5: 'I-FAMILIAR', 6: 'O', 7: 'B-OTROS', 8: 'I-SANITARIO', 9: 'PAD'} ) df = df[1:-1] df['relation'] = df['relation'].astype('int') df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x)) df = df.groupby('relation').first() df ``` **Output** | relation | labels | tokens | |:--------:|:-------------:|:-----------:| | 0 | ĠEl | O | | 1 | Ġpaciente | O | | 2 | Ġtrabaja | B-PACIENTE | | 3 | Ġen | I-PACIENTE | | 4 | Ġuna | I-PACIENTE | | 5 | Ġempresa | I-PACIENTE | | 6 | Ġde | I-PACIENTE | | 7 | Ġconstruccion | I-PACIENTE | | 8 | Ġlos | O | | 9 | Ġjueves | O |