prot_t5_xxl_bfd / README.md
pavm595's picture
Update README.md
408959f verified
---
license: mit
language:
- en
pipeline_tag: feature-extraction
tags:
- protein-language-model
library_name: transformers
---
# ProtT5-XXL-BFD
ProtT5-XXL is a protein Language Model (pLM) pretrained on BFD, a dataset consisting of 2.1 billion protein sequences, using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction). The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_t5_xxl_bfd/tree/main).
## Inference example
Below is a minimal example showing how to load and run inference with **ProtT5-XXL-BFD** directly from the Hugging Face Hub using `transformers` library.
```python
from transformers import T5EncoderModel, T5Tokenizer
import re
import torch
# Load the vocabulary and ProtT5-XL-BFD Model
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xxl_bfd", do_lower_case=False)
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xxl_bfd")
# Load the model into the GPU if avilabile and switch to inference mode
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model = model.eval()
# Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)
sequences_Example = ["A E T C Z A O","S K T Z P"]
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
# Tokenize, encode sequences and load it into the GPU if possible
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
# Extracting sequences' features and load it into the CPU if needed
with torch.no_grad():
embedding = model(input_ids=input_ids,attention_mask=attention_mask)
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()
```
## Metadata
### Input
The model takes as input uppercase amino acids. It works only with capital letter amino acids. Example: "A E T C Z A O".
### Output
#### Features extraction
The model returns the embedding as an array of floats.
# Copyright
Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.