yarongef
/

DistilProtBert

protein language model

Model card Files Files and versions

yarongef commited on Mar 30, 2022

Commit

ca62e58

·

1 Parent(s): 36d6389

Update README.md

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -15,8 +15,6 @@ In addition to cross entropy and cosine teacher-student losses, DistilProtBert w
 # Model description
 DistilProtBert was pretrained on millions of proteins sequences.
-This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
-publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
 Few important differences between DistilProtBert model and the original ProtBert version are:
 1. The size of the model
@@ -33,5 +31,9 @@ The model can be used the same as ProtBert.
 ## Training data
-DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).

 # Model description
 DistilProtBert was pretrained on millions of proteins sequences.
 Few important differences between DistilProtBert model and the original ProtBert version are:
 1. The size of the model
 ## Training data
+DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
+# Pretraining procedure
+Preprocessing was done using ProtBert's tokenizer.
+The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert).