|
|
--- |
|
|
tags: |
|
|
- protein-language-model |
|
|
- protein |
|
|
datasets: |
|
|
- bloyal/uniref100 |
|
|
--- |
|
|
|
|
|
# ProtBert model |
|
|
|
|
|
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in |
|
|
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in |
|
|
[this repository](https://github.com/agemagician/ProtTrans). his repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_bert/tree/main). This model is trained on uppercase amino acids: it only works with capital letter amino acids. |
|
|
|
|
|
|
|
|
## Model description |
|
|
|
|
|
ProtBert is based on Bert model which pretrained on a large corpus of protein sequences in a self-supervised fashion. |
|
|
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of |
|
|
publicly available data) with an automatic process to generate inputs and labels from those protein sequences. |
|
|
|
|
|
One important difference between our Bert model and the original Bert version is the way of dealing with sequences as separate documents. |
|
|
This means the Next sentence prediction is not used, as each sequence is treated as a complete document. |
|
|
The masking follows the original Bert training with randomly masks 15% of the amino acids in the input. |
|
|
|
|
|
At the end, the feature extracted from this model revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein |
|
|
shape. |
|
|
This implied learning some of the grammar of the language of life realized in protein sequences. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. |
|
|
We have noticed in some tasks you could gain more accuracy by fine-tuning the model rather than using it as a feature extractor. |
|
|
|
|
|
### How to use |
|
|
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
|
|
```python |
|
|
>>> from transformers import BertForMaskedLM, BertTokenizer, pipeline |
|
|
>>> tokenizer = BertTokenizer.from_pretrained("virtual-human-chc/prot_bert", do_lower_case=False ) |
|
|
>>> model = BertForMaskedLM.from_pretrained("virtual-human-chc/prot_bert") |
|
|
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) |
|
|
>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T') |
|
|
|
|
|
[{'score': 0.11088453233242035, |
|
|
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]', |
|
|
'token': 5, |
|
|
'token_str': 'L'}, |
|
|
{'score': 0.08402521163225174, |
|
|
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]', |
|
|
'token': 10, |
|
|
'token_str': 'S'}, |
|
|
{'score': 0.07328339666128159, |
|
|
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]', |
|
|
'token': 8, |
|
|
'token_str': 'V'}, |
|
|
{'score': 0.06921856850385666, |
|
|
'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]', |
|
|
'token': 12, |
|
|
'token_str': 'K'}, |
|
|
{'score': 0.06382402777671814, |
|
|
'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]', |
|
|
'token': 11, |
|
|
'token_str': 'I'}] |
|
|
``` |
|
|
|
|
|
Here is how to use this model to get the features of a given protein sequence in PyTorch: |
|
|
|
|
|
```python |
|
|
from transformers import BertModel, BertTokenizer |
|
|
import re |
|
|
tokenizer = BertTokenizer.from_pretrained("virtual-human-chc/prot_bert", do_lower_case=False ) |
|
|
model = BertModel.from_pretrained("virtual-human-chc/prot_bert") |
|
|
sequence_Example = "A E T C Z A O" |
|
|
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example) |
|
|
encoded_input = tokenizer(sequence_Example, return_tensors='pt') |
|
|
output = model(**encoded_input) |
|
|
``` |
|
|
|
|
|
## Training data |
|
|
|
|
|
The ProtBert model was pretrained on [Uniref100](https://www.uniprot.org/downloads), a dataset consisting of 217 million protein sequences. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X". |
|
|
The inputs of the model are then of the form: |
|
|
|
|
|
``` |
|
|
[CLS] Protein Sequence A [SEP] Protein Sequence B [SEP] |
|
|
``` |
|
|
|
|
|
Furthermore, each protein sequence was treated as a separate document. |
|
|
The preprocessing step was performed twice, once for a combined length (2 sequences) of less than 512 amino acids, and another time using a combined length (2 sequences) of less than 2048 amino acids. |
|
|
|
|
|
The details of the masking procedure for each sequence followed the original Bert model as following: |
|
|
- 15% of the amino acids are masked. |
|
|
- In 80% of the cases, the masked amino acids are replaced by `[MASK]`. |
|
|
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace. |
|
|
- In the 10% remaining cases, the masked amino acids are left as is. |
|
|
|
|
|
### Pretraining |
|
|
|
|
|
The model was trained on a single TPU Pod V3-512 for 400k steps in total. |
|
|
300K steps using sequence length 512 (batch size 15k), and 100K steps using sequence length 2048 (batch size 2.5k). |
|
|
The optimizer used is Lamb with a learning rate of 0.002, a weight decay of 0.01, learning rate warmup for 40k steps and linear decay of the learning rate after. |
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results: |
|
|
|
|
|
Test results : |
|
|
|
|
|
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane | |
|
|
|:-----:|:-----:|:-----:|:-----:|:-----:| |
|
|
| CASP12 | 75 | 63 | | | |
|
|
| TS115 | 83 | 72 | | | |
|
|
| CB513 | 81 | 66 | | | |
|
|
| DeepLoc | | | 79 | 91 | |
|
|
|
|
|
# Copyright |
|
|
|
|
|
Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. |