Update README.md
Browse files
README.md
CHANGED
|
@@ -12,4 +12,35 @@ widget:
|
|
| 12 |
example_title: "Example 2"
|
| 13 |
- text: "Il GABA è un amminoacido ed è il principale neurotrasmettitore inibitorio del [MASK]."
|
| 14 |
example_title: "Example 3"
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
example_title: "Example 2"
|
| 13 |
- text: "Il GABA è un amminoacido ed è il principale neurotrasmettitore inibitorio del [MASK]."
|
| 14 |
example_title: "Example 3"
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
🤗 + 📚🩺🇮🇹 = BioBIT
|
| 19 |
+
|
| 20 |
+
In this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint. You can find the full paper, with all details you need at [this link](https://www.sciencedirect.com/science/article/pii/S1532046423001521).
|
| 21 |
+
|
| 22 |
+
BioBIT is created started from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
|
| 23 |
+
|
| 24 |
+
To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
|
| 25 |
+
|
| 26 |
+
Due to the unavailability of an Italian equivalent for the millions of abstracts and full-text scientific papers used by English, BERT-based biomedical models, in this work we leveraged machine translation to obtain an Italian biomedical corpus based on PubMed abstracts and train BioBIT. More details in the paper.
|
| 27 |
+
|
| 28 |
+
BioBIT has been evaluated on 3 downstream tasks: **NER** (Named Entity Recognition), extractive **QA** (Question Answering), **RE** (Relation Extraction).
|
| 29 |
+
Here are the results, summarized:
|
| 30 |
+
- NER:
|
| 31 |
+
- [BC2GM](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb32) = 82.14%
|
| 32 |
+
- [BC4CHEMD](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb35) = 80.70%
|
| 33 |
+
- [BC5CDR(CDR)](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb31) = 82.15%
|
| 34 |
+
- [BC5CDR(DNER)](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb31) = 76.27%
|
| 35 |
+
- [NCBI_DISEASE](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb33) = 65.06%
|
| 36 |
+
- [SPECIES-800](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb34) = 61.86%
|
| 37 |
+
- QA:
|
| 38 |
+
- [BioASQ 4b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 68.49%
|
| 39 |
+
- [BioASQ 5b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 78.33%
|
| 40 |
+
- [BioASQ 6b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 75.73%
|
| 41 |
+
- RE:
|
| 42 |
+
- [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
|
| 43 |
+
- [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
|
| 44 |
+
More details in the paper.
|
| 45 |
+
|
| 46 |
+
Feel free to contact us if you have some inquiry!
|