--- license: mit language: - en pipeline_tag: feature-extraction tags: - protein-language-model library_name: transformers --- # ProtT5-XXL-BFD ProtT5-XXL is a protein Language Model (pLM) pretrained on BFD, a dataset consisting of 2.1 billion protein sequences, using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction). The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_t5_xxl_bfd/tree/main). ## Inference example Below is a minimal example showing how to load and run inference with **ProtT5-XXL-BFD** directly from the Hugging Face Hub using `transformers` library. ```python from transformers import T5EncoderModel, T5Tokenizer import re import torch # Load the vocabulary and ProtT5-XL-BFD Model tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xxl_bfd", do_lower_case=False) model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xxl_bfd") # Load the model into the GPU if avilabile and switch to inference mode device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') model = model.to(device) model = model.eval() # Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X) sequences_Example = ["A E T C Z A O","S K T Z P"] sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example] # Tokenize, encode sequences and load it into the GPU if possible ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True) input_ids = torch.tensor(ids['input_ids']).to(device) attention_mask = torch.tensor(ids['attention_mask']).to(device) # Extracting sequences' features and load it into the CPU if needed with torch.no_grad(): embedding = model(input_ids=input_ids,attention_mask=attention_mask) encoder_embedding = embedding[2].cpu().numpy() decoder_embedding = embedding[0].cpu().numpy() ``` ## Metadata ### Input The model takes as input uppercase amino acids. It works only with capital letter amino acids. Example: "A E T C Z A O". ### Output #### Features extraction The model returns the embedding as an array of floats. # Copyright Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.