Update README.md
Browse files
README.md
CHANGED
|
@@ -15,8 +15,6 @@ In addition to cross entropy and cosine teacher-student losses, DistilProtBert w
|
|
| 15 |
# Model description
|
| 16 |
|
| 17 |
DistilProtBert was pretrained on millions of proteins sequences.
|
| 18 |
-
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
|
| 19 |
-
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
|
| 20 |
|
| 21 |
Few important differences between DistilProtBert model and the original ProtBert version are:
|
| 22 |
1. The size of the model
|
|
@@ -33,5 +31,9 @@ The model can be used the same as ProtBert.
|
|
| 33 |
|
| 34 |
## Training data
|
| 35 |
|
| 36 |
-
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
# Model description
|
| 16 |
|
| 17 |
DistilProtBert was pretrained on millions of proteins sequences.
|
|
|
|
|
|
|
| 18 |
|
| 19 |
Few important differences between DistilProtBert model and the original ProtBert version are:
|
| 20 |
1. The size of the model
|
|
|
|
| 31 |
|
| 32 |
## Training data
|
| 33 |
|
| 34 |
+
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
|
| 35 |
|
| 36 |
+
# Pretraining procedure
|
| 37 |
+
|
| 38 |
+
Preprocessing was done using ProtBert's tokenizer.
|
| 39 |
+
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert).
|