ProtT5-XXL-BFD

ProtT5-XXL is a protein Language Model (pLM) pretrained on BFD, a dataset consisting of 2.1 billion protein sequences, using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction). The model was developed by Ahmed Elnaggar et al. and more information can be found on the GitHub repository and in the accompanying paper. This repository is a fork of their HuggingFace repository.

Inference example

Below is a minimal example showing how to load and run inference with ProtT5-XXL-BFD directly from the Hugging Face Hub using transformers library.

from transformers import T5EncoderModel, T5Tokenizer
import re
import torch 

# Load the vocabulary and ProtT5-XL-BFD Model
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xxl_bfd", do_lower_case=False)
 
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xxl_bfd")

# Load the model into the GPU if avilabile and switch to inference mode
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
model = model.eval()

# Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)
sequences_Example = ["A E T C Z A O","S K T Z P"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

# Tokenize, encode sequences and load it into the GPU if possible
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# Extracting sequences' features and load it into the CPU if needed
with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask)

encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()

Metadata

Input

The model takes as input uppercase amino acids. It works only with capital letter amino acids. Example: "A E T C Z A O".

Output

Features extraction

The model returns the embedding as an array of floats.

Copyright

Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the Academic Free License v3.0 License, Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including virtual-human-chc/prot_t5_xxl_bfd