|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- protein-language-model |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ProtT5-XXL-BFD |
|
|
|
|
|
ProtT5-XXL is a protein Language Model (pLM) pretrained on BFD, a dataset consisting of 2.1 billion protein sequences, using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction). The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_t5_xxl_bfd/tree/main). |
|
|
|
|
|
## Inference example |
|
|
|
|
|
Below is a minimal example showing how to load and run inference with **ProtT5-XXL-BFD** directly from the Hugging Face Hub using `transformers` library. |
|
|
|
|
|
```python |
|
|
from transformers import T5EncoderModel, T5Tokenizer |
|
|
import re |
|
|
import torch |
|
|
|
|
|
# Load the vocabulary and ProtT5-XL-BFD Model |
|
|
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xxl_bfd", do_lower_case=False) |
|
|
|
|
|
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xxl_bfd") |
|
|
|
|
|
# Load the model into the GPU if avilabile and switch to inference mode |
|
|
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') |
|
|
|
|
|
model = model.to(device) |
|
|
model = model.eval() |
|
|
|
|
|
# Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X) |
|
|
sequences_Example = ["A E T C Z A O","S K T Z P"] |
|
|
|
|
|
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example] |
|
|
|
|
|
# Tokenize, encode sequences and load it into the GPU if possible |
|
|
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True) |
|
|
|
|
|
input_ids = torch.tensor(ids['input_ids']).to(device) |
|
|
attention_mask = torch.tensor(ids['attention_mask']).to(device) |
|
|
|
|
|
# Extracting sequences' features and load it into the CPU if needed |
|
|
with torch.no_grad(): |
|
|
embedding = model(input_ids=input_ids,attention_mask=attention_mask) |
|
|
|
|
|
encoder_embedding = embedding[2].cpu().numpy() |
|
|
decoder_embedding = embedding[0].cpu().numpy() |
|
|
``` |
|
|
|
|
|
## Metadata |
|
|
|
|
|
### Input |
|
|
|
|
|
The model takes as input uppercase amino acids. It works only with capital letter amino acids. Example: "A E T C Z A O". |
|
|
|
|
|
### Output |
|
|
|
|
|
#### Features extraction |
|
|
The model returns the embedding as an array of floats. |
|
|
|
|
|
# Copyright |
|
|
|
|
|
Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. |