yarongef
/

DistilProtBert

protein language model

Model card Files Files and versions

yarongef commited on Mar 30, 2022

Commit

36d6389

·

1 Parent(s): 0ac854d

Update README.md

Files changed (1) hide show

README.md +22 -0

README.md CHANGED Viewed

@@ -13,3 +13,25 @@ Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_ber
 In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids.
 # Model description

 In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids.
 # Model description
+DistilProtBert was pretrained on millions of proteins sequences.
+This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
+publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
+Few important differences between DistilProtBert model and the original ProtBert version are:
+1. The size of the model
+2. The size of the pretraining dataset
+3. Time & hardware used for pretraining
+## Intended uses & limitations
+The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
+### How to use
+The model can be used the same as ProtBert.
+## Training data
+DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).