Neuroinformatica commited on
Commit
7e4a272
·
verified ·
1 Parent(s): b6ecfe4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -1
README.md CHANGED
@@ -12,4 +12,35 @@ widget:
12
  example_title: "Example 2"
13
  - text: "Il GABA è un amminoacido ed è il principale neurotrasmettitore inibitorio del [MASK]."
14
  example_title: "Example 3"
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  example_title: "Example 2"
13
  - text: "Il GABA è un amminoacido ed è il principale neurotrasmettitore inibitorio del [MASK]."
14
  example_title: "Example 3"
15
+
16
+ ---
17
+
18
+ 🤗 + 📚🩺🇮🇹 = BioBIT
19
+
20
+ In this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint. You can find the full paper, with all details you need at [this link](https://www.sciencedirect.com/science/article/pii/S1532046423001521).
21
+
22
+ BioBIT is created started from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
23
+
24
+ To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
25
+
26
+ Due to the unavailability of an Italian equivalent for the millions of abstracts and full-text scientific papers used by English, BERT-based biomedical models, in this work we leveraged machine translation to obtain an Italian biomedical corpus based on PubMed abstracts and train BioBIT. More details in the paper.
27
+
28
+ BioBIT has been evaluated on 3 downstream tasks: **NER** (Named Entity Recognition), extractive **QA** (Question Answering), **RE** (Relation Extraction).
29
+ Here are the results, summarized:
30
+ - NER:
31
+ - [BC2GM](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb32) = 82.14%
32
+ - [BC4CHEMD](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb35) = 80.70%
33
+ - [BC5CDR(CDR)](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb31) = 82.15%
34
+ - [BC5CDR(DNER)](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb31) = 76.27%
35
+ - [NCBI_DISEASE](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb33) = 65.06%
36
+ - [SPECIES-800](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb34) = 61.86%
37
+ - QA:
38
+ - [BioASQ 4b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 68.49%
39
+ - [BioASQ 5b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 78.33%
40
+ - [BioASQ 6b](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb30) = 75.73%
41
+ - RE:
42
+ - [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
43
+ - [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
44
+ More details in the paper.
45
+
46
+ Feel free to contact us if you have some inquiry!