virtual-human-chc
/

prot_t5_xxl_bfd

Feature Extraction

protein-language-model

text-generation-inference

Model card Files Files and versions

prot_t5_xxl_bfd / README.md

pavm595's picture

Update README.md

408959f verified 3 months ago

|

history blame contribute delete

2.81 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: feature-extraction
	tags:
	- protein-language-model
	library_name: transformers
	---

	# ProtT5-XXL-BFD

	ProtT5-XXL is a protein Language Model (pLM) pretrained on BFD, a dataset consisting of 2.1 billion protein sequences, using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction). The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_t5_xxl_bfd/tree/main).

	## Inference example

	Below is a minimal example showing how to load and run inference with ProtT5-XXL-BFD directly from the Hugging Face Hub using `transformers` library.

	```python
	from transformers import T5EncoderModel, T5Tokenizer
	import re
	import torch

	# Load the vocabulary and ProtT5-XL-BFD Model
	tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xxl_bfd", do_lower_case=False)

	model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xxl_bfd")

	# Load the model into the GPU if avilabile and switch to inference mode
	device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

	model = model.to(device)
	model = model.eval()

	# Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)
	sequences_Example = ["A E T C Z A O","S K T Z P"]

	sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

	# Tokenize, encode sequences and load it into the GPU if possible
	ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

	input_ids = torch.tensor(ids['input_ids']).to(device)
	attention_mask = torch.tensor(ids['attention_mask']).to(device)

	# Extracting sequences' features and load it into the CPU if needed
	with torch.no_grad():
	embedding = model(input_ids=input_ids,attention_mask=attention_mask)

	encoder_embedding = embedding[2].cpu().numpy()
	decoder_embedding = embedding[0].cpu().numpy()
	```

	## Metadata

	### Input

	The model takes as input uppercase amino acids. It works only with capital letter amino acids. Example: "A E T C Z A O".

	### Output

	#### Features extraction
	The model returns the embedding as an array of floats.

	# Copyright

	Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.