Asier Gutiérrez Fandiño commited on
Commit ·
8630abf
1
Parent(s): 2e42ceb
Initial commit
Browse files- README.md +165 -0
- args.json +17 -0
- config.json +25 -0
- dict.txt +0 -0
- merges.txt +0 -0
- process.log +1 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- vocab.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- es
|
| 4 |
+
tags:
|
| 5 |
+
- biomedical
|
| 6 |
+
- spanish
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
metrics:
|
| 9 |
+
- ppl
|
| 10 |
+
widget:
|
| 11 |
+
- text: "El único antecedente personal a reseñar era la <mask> arterial."
|
| 12 |
+
- text: "Las radiologías óseas de cuerpo entero no detectan alteraciones <mask>, ni alteraciones vertebrales."
|
| 13 |
+
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Biomedical language model for Spanish
|
| 17 |
+
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
## Tokenization and model pretraining
|
| 21 |
+
|
| 22 |
+
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
| 23 |
+
**biomedical** corpus in Spanish collected from several sources (see next section).
|
| 24 |
+
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
| 25 |
+
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
## Training corpora and preprocessing
|
| 29 |
+
|
| 30 |
+
The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
|
| 31 |
+
To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
|
| 32 |
+
|
| 33 |
+
- data parsing in different formats
|
| 34 |
+
- sentence splitting
|
| 35 |
+
- language detection
|
| 36 |
+
- filtering of ill-formed sentences
|
| 37 |
+
- deduplication of repetitive contents
|
| 38 |
+
- keep the original document boundaries
|
| 39 |
+
|
| 40 |
+
Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
|
| 41 |
+
The result is a medium-size biomedical corpus for Spanish composed of about 963M tokens. The table below shows some basic statistics of the individual cleaned corpora:
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
| Name | No. tokens | Description |
|
| 45 |
+
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 46 |
+
| [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
| 47 |
+
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
|
| 48 |
+
| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
| 49 |
+
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
| 50 |
+
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles crawled 04/01/2021 with the [Wikipedia API python library](https://pypi.org/project/Wikipedia-API/) starting from the "Ciencias\_de\_la\_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content. |
|
| 51 |
+
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
|
| 52 |
+
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
|
| 53 |
+
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
|
| 54 |
+
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
## Evaluation and results
|
| 59 |
+
|
| 60 |
+
The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
|
| 61 |
+
|
| 62 |
+
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
| 63 |
+
|
| 64 |
+
- [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
|
| 65 |
+
|
| 66 |
+
- ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
|
| 67 |
+
|
| 68 |
+
The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
|
| 69 |
+
|
| 70 |
+
| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
|
| 71 |
+
|---------------------------|----------------------------|-------------------------------|-------------------------|
|
| 72 |
+
| PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
|
| 73 |
+
| CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
|
| 74 |
+
| ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Intended uses & limitations
|
| 78 |
+
|
| 79 |
+
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
| 80 |
+
|
| 81 |
+
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
| 82 |
+
|
| 83 |
+
## Cite
|
| 84 |
+
If you use our models, please cite our latest preprint:
|
| 85 |
+
|
| 86 |
+
```bibtex
|
| 87 |
+
|
| 88 |
+
@misc{carrino2021biomedical,
|
| 89 |
+
title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
|
| 90 |
+
author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
|
| 91 |
+
year={2021},
|
| 92 |
+
eprint={2109.03570},
|
| 93 |
+
archivePrefix={arXiv},
|
| 94 |
+
primaryClass={cs.CL}
|
| 95 |
+
}
|
| 96 |
+
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
If you use our Medical Crawler corpus, please cite the preprint:
|
| 100 |
+
|
| 101 |
+
```bibtex
|
| 102 |
+
|
| 103 |
+
@misc{carrino2021spanish,
|
| 104 |
+
title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models},
|
| 105 |
+
author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas},
|
| 106 |
+
year={2021},
|
| 107 |
+
eprint={2109.07765},
|
| 108 |
+
archivePrefix={arXiv},
|
| 109 |
+
primaryClass={cs.CL}
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## How to use
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 120 |
+
|
| 121 |
+
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
| 122 |
+
|
| 123 |
+
model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
| 124 |
+
|
| 125 |
+
from transformers import pipeline
|
| 126 |
+
|
| 127 |
+
unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
|
| 128 |
+
|
| 129 |
+
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
| 130 |
+
```
|
| 131 |
+
```
|
| 132 |
+
# Output
|
| 133 |
+
[
|
| 134 |
+
{
|
| 135 |
+
"sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
|
| 136 |
+
"score": 0.9855039715766907,
|
| 137 |
+
"token": 3529,
|
| 138 |
+
"token_str": " hipertensión"
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
|
| 142 |
+
"score": 0.0039140828885138035,
|
| 143 |
+
"token": 1945,
|
| 144 |
+
"token_str": " diabetes"
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
|
| 148 |
+
"score": 0.002484665485098958,
|
| 149 |
+
"token": 11483,
|
| 150 |
+
"token_str": " hipotensión"
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
|
| 154 |
+
"score": 0.0023484621196985245,
|
| 155 |
+
"token": 12238,
|
| 156 |
+
"token_str": " Hipertensión"
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"sequence": " El único antecedente personal a reseñar era la presión arterial.",
|
| 160 |
+
"score": 0.0008009297889657319,
|
| 161 |
+
"token": 2267,
|
| 162 |
+
"token_str": " presión"
|
| 163 |
+
}
|
| 164 |
+
]
|
| 165 |
+
```
|
args.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"output_root": "/gpfs/projects/bsc88/corpus-utils-lm/23-12-2020-72f8c7e/output/model-ready_output/2020-12-23-1900-daf4-ab38",
|
| 3 |
+
"files": "/gpfs/projects/bsc88/corpus-utils-lm/23-12-2020-72f8c7e/output/model-ready_output/2020-12-23-1900-daf4-ab38/train_valid_test_split_output/2020-12-23-1905-daf4-a0e0/train.txt",
|
| 4 |
+
"vocab_name": "roberta-ca",
|
| 5 |
+
"clean_text": true,
|
| 6 |
+
"handle_chinese_chars": true,
|
| 7 |
+
"strip_accents": false,
|
| 8 |
+
"lowercase": false,
|
| 9 |
+
"vocab_size": 52000,
|
| 10 |
+
"limit_alphabet": 1000,
|
| 11 |
+
"show_progress": true,
|
| 12 |
+
"min_frequency": 2,
|
| 13 |
+
"extra_tokens": [],
|
| 14 |
+
"reserve_tokens": 0,
|
| 15 |
+
"tokenizer": "bbpe",
|
| 16 |
+
"commit_hash": "daf4d660ec8a4b28d2bc29b3063779100ab85796\n"
|
| 17 |
+
}
|
config.json
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"RobertaForMaskedLM"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"bos_token_id": 0,
|
| 7 |
+
"eos_token_id": 2,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-05,
|
| 15 |
+
"max_position_embeddings": 514,
|
| 16 |
+
"model_type": "roberta",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 1,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"transformers_version": "4.4.0",
|
| 22 |
+
"type_vocab_size": 1,
|
| 23 |
+
"use_cache": true,
|
| 24 |
+
"vocab_size": 52000
|
| 25 |
+
}
|
dict.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
process.log
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
INFO:root:Function "train_tokenizer" took 306.3926444167737 seconds to complete.
|
pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d720b1dddaef37080df8761bea199e3a307cd86cdd261fe3430a674579118f21
|
| 3 |
+
size 504420627
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": true, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "max_len": 512, "special_tokens_map_file": null, "name_or_path": "/gpfs/projects/bsc88/corpus-utils-lm/23-12-2020-72f8c7e/output/model-ready_output/2020-12-23-1900-daf4-ab38/train_tokenizer_output/2020-12-23-1913-daf4-ed9c"}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|