|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- protein language model |
|
|
datasets: |
|
|
- BFD |
|
|
license: afl-3.0 |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ProtT5-XL-BFD model |
|
|
|
|
|
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in |
|
|
[this paper](https://doi.org/10.1101/2020.07.12.199554) Ahmed Elnaggar et al. and first released in |
|
|
[this repository](https://github.com/agemagician/ProtTrans). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_t5_xxl_bfd/tree/main). This model is trained on uppercase amino acids: it only works with capital letter amino acids. |
|
|
|
|
|
|
|
|
## Model description |
|
|
|
|
|
ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion. |
|
|
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of |
|
|
publicly available data) with an automatic process to generate inputs and labels from those protein sequences. |
|
|
|
|
|
One important difference between this T5 model and the original T5 version is the denosing objective. |
|
|
The original T5-3B model was pretrained using a span denosing objective, while this model was pre-trained with a Bart-like MLM denosing objective. |
|
|
The masking probability is consistent with the original T5 training by randomly masking 15% of the amino acids in the input. |
|
|
|
|
|
It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape. |
|
|
shape. |
|
|
This implied learning some of the grammar of the language of life realized in protein sequences. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. |
|
|
We have noticed in some tasks on can gain more accuracy by fine-tuning the model rather than using it as a feature extractor. |
|
|
We have also noticed that for feature extraction, its better to use the feature extracted from the encoder not from the decoder. |
|
|
|
|
|
### How to use |
|
|
|
|
|
Here is how to use this model to extract the features of a given protein sequence in PyTorch: |
|
|
|
|
|
```python |
|
|
from transformers import T5Tokenizer, T5Model |
|
|
import re |
|
|
import torch |
|
|
|
|
|
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False) |
|
|
|
|
|
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd") |
|
|
|
|
|
sequences_Example = ["A E T C Z A O", "S K T Z P"] |
|
|
|
|
|
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example] |
|
|
|
|
|
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True) |
|
|
|
|
|
input_ids = torch.tensor(ids['input_ids']) |
|
|
attention_mask = torch.tensor(ids['attention_mask']) |
|
|
|
|
|
with torch.no_grad(): |
|
|
embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None) |
|
|
|
|
|
# For feature extraction we recommend to use the encoder embedding |
|
|
encoder_embedding = embedding[2].cpu().numpy() |
|
|
decoder_embedding = embedding[0].cpu().numpy() |
|
|
``` |
|
|
|
|
|
## Training data |
|
|
|
|
|
The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X". |
|
|
The inputs of the model are then of the form: |
|
|
|
|
|
``` |
|
|
Protein Sequence [EOS] |
|
|
``` |
|
|
|
|
|
The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens. |
|
|
|
|
|
The details of the masking procedure for each sequence are as follows: |
|
|
- 15% of the amino acids are masked. |
|
|
- In 90% of the cases, the masked amino acids are replaced by `[MASK]` token. |
|
|
- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace. |
|
|
|
|
|
### Pretraining |
|
|
|
|
|
The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k). |
|
|
It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture. |
|
|
The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training. |
|
|
|
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
When the model is used for feature etraction, this model achieves the following results: |
|
|
|
|
|
Test results : |
|
|
|
|
|
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane | |
|
|
|:-----:|:-----:|:-----:|:-----:|:-----:| |
|
|
| CASP12 | 77 | 66 | | | |
|
|
| TS115 | 85 | 74 | | | |
|
|
| CB513 | 84 | 71 | | | |
|
|
| DeepLoc | | | 77 | 91 | |
|
|
|
|
|
# Copyright |
|
|
|
|
|
Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. |