Text Generation
Transformers
Safetensors
mixtral
biology
protein-language-model
protein-generation
msa
multiple-sequence-alignment
few-shot-prompting
homolog-conditioned-generation
causal-lm
mixture-of-experts
text-generation-inference
Instructions to use AI4PD/ProtGPT3-MSA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AI4PD/ProtGPT3-MSA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AI4PD/ProtGPT3-MSA")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AI4PD/ProtGPT3-MSA") model = AutoModelForCausalLM.from_pretrained("AI4PD/ProtGPT3-MSA") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AI4PD/ProtGPT3-MSA with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AI4PD/ProtGPT3-MSA" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4PD/ProtGPT3-MSA", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AI4PD/ProtGPT3-MSA
- SGLang
How to use AI4PD/ProtGPT3-MSA with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AI4PD/ProtGPT3-MSA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4PD/ProtGPT3-MSA", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AI4PD/ProtGPT3-MSA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4PD/ProtGPT3-MSA", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AI4PD/ProtGPT3-MSA with Docker Model Runner:
docker model run hf.co/AI4PD/ProtGPT3-MSA
File size: 10,850 Bytes
05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 950de3c 05a5b3c 3bc2600 05a5b3c 9c51d8e 05a5b3c 7dc5609 05a5b3c 9a9f080 93076a3 05a5b3c 950de3c 9c51d8e 05a5b3c 9c51d8e 05a5b3c 9c51d8e 05a5b3c 9c51d8e 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 3672979 6996a68 3672979 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 3672979 6996a68 05a5b3c 6996a68 05a5b3c d15b51f 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 4209ea3 6996a68 4209ea3 6996a68 05a5b3c 0e8bb50 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 4209ea3 6996a68 4209ea3 6996a68 05a5b3c 9ded02f 0e8bb50 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 4209ea3 05a5b3c 9c51d8e 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 f999052 6996a68 f999052 6996a68 05a5b3c 6996a68 05a5b3c 6996a68 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | ---
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- msa
- multiple-sequence-alignment
- few-shot-prompting
- homolog-conditioned-generation
- causal-lm
- mixture-of-experts
- transformers
---
# Model Card for ProtGPT3-MSA
## Model Description
ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the [ProtGPT3 family](https://huggingface.co/collections/AI4PD/protgpt3-family), an open-source suite of promptable and aligned protein language models for protein sequence generation.
Unlike the single-sequence ProtGPT3 models, ProtGPT3-MSA can be prompted with sets of homologous protein sequences, enabling few-shot, family-conditioned protein generation without task-specific fine-tuning. At inference, users can provide homologous protein sequences as context and generate additional family-consistent sequences.
ProtGPT3-MSA was trained to autoregressively predict sets of 16 concatenated protein sequences, separated by a special token `<s>` (i.e., marking the protein boundaries). Therefore, at inference, the model should be prompted with at most 15 concatenated protein sequences.
- For more details on how to use ProtGPT3-MSA check out our [colab](https://colab.research.google.com/drive/1HZFLUkRIhjUJdbQyvJC8ftio_ZHNL7kI?usp=sharing#scrollTo=zwWWcxwkPm6c).
- For a quick usage of the model for generating new sequences by prompting it with a fasta file of homologous sequences check out [ProtGPT3-MSA API](https://huggingface.co/spaces/AI4PD/ProtGPT3-MSA).
### Model Modalities
1. **Aligned vs unaligned mode**:ProtGPT3-MSA has been trained to process concatenated sets of homologs in both "aligned" (i.e., the homologs are passed aligned with gap tokens) and "unaligned" mode via special `<gap>` and `<no_gap>` tokens, which should be placed at the start of the concatenated protein sequences to select the modality.
2. **N-to-C vs C-to-N**:ProtGPT3-MSA has been trained to process concatenated homologs in both N-to-C and C-to-N directions, via two special "directional" tokens, "1" for N-to-C and "2" for C-to-N which which should be placed at the start of the concatenated protein sequences (i.e., before the gap token) to select the direction.
We provide some examples below.
## Uses
## How to Get Started with the Model
Install dependencies:
```bash
pip install transformers accelerate torch
```
Load the model and tokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import random
import re
# ---- Intialise useful methods to prompt ProtGPT3-MSA ----
def process_style(seq: str, gap: bool):
"""Remove gaps, uppercase insertions, drop X."""
if gap:
# keep gaps
return re.sub(r"[X]", "", seq.upper())
else:
# remove gaps
return re.sub(r"[X]", "", seq.replace("-", "").upper())
def build_prompt(
sequences: list,
gap: bool = False,
direction: str ="1"
) -> str:
"""Build prompt for ProtGPT3-MSA
Args:
sequences: list of up to 15 homologous protein sequences (i.e., each entry in the list should be a homolog)
gap: if True, process sequence in the aligned mode, so the homologs in sequences should be aligned (i.e., same length with gap tokens) if False sequences can be unaligned.
direction: direction in which sequences should be processed/generated, "1": N-to-C, pass "2" to generate homologs in reversed C-to-N direction, importantly sequences should also be reversed if direction="2"
"""
assert len(sequences) <= 15, "The model cannot be prompted with more than 15 sequences (i.e., reduce the number of sequences to 15 or less)"
# randomise order of sequences
random.shuffle(sequences)
if gap:
gap_token = "<gap>"
assert all(len(s) == len(sequences[0]) for s in sequences), "Sequences in the prompt have different len(), but should be aligned, either align them or use no_gap mode"
else:
gap_token = "<no_gap>"
tokens: List[str] = ["<|bos|>", direction, gap_token]
for seq in sequences:
# add separator token between sequences
tokens.append("<s>")
tokens.extend(list(process_style(seq,gap=gap)))
# Match train-time separator before continuation
tokens.append("<s>")
return " ".join(tokens)
## --------------------------------------
model_id = "AI4PD/ProtGPT3-MSA"
# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=False, add_eos_token=False, padding_side="left") # BOS token manually added in build_prompt
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
```
### Few-shot generation with unaligned homologs
Use the `<no_gap>` modality token for unaligned sequences. Separate homologous sequences with the `<s>` separator token.
```python
import torch
homologs = [
"MKTAYIAKQRQISFVKSHFSRQDILD",
"MKTVYIAKQRQISFVKSHFSRQDILD",
"MKTAYIAKQRQINNVKSHFSRQNILD",
# Add up to 15 homologous protein sequences
]
prompt = build_prompt(sequences=homologs)
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=512, # CHANGE to desire length (i.e., protein length times n. of generated homologs sequentially)
do_sample=True,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
num_return_sequences=20, # set to desired number of protein sequences to be generated in parallel
)
generated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# split sequences generated sequentially
segments = generated.split("<s>")
# print each sequence
for s in segments:
print(s.replace(" ",""),"\n")
```
### Few-shot generation with aligned homologs
Use the `<gap>` modality token for aligned sequences. Gap characters may be included in the prompted sequences.
```python
import torch
# must have the same length and be aligned
aligned_homologs = [
"MKTAYIAKQRQI--SFVKSHFSRQDILD",
"MKTVYIAKQRQI--SFVKSHFSRQDILD",
"MKTAYIAKQRQINNSFVKSHFSRQNILD",
]
prompt = build_prompt(sequences=aligned_homologs, gap=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=512, # CHANGE to desire length (i.e., protein length times n. of generated homologs sequentially)
do_sample=True,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
num_return_sequences=20, # set to desired number of protein sequences to be generated in parallel
)
generated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# split sequences generated sequentially
segments = generated.split("<s>")
# print each sequence
for s in segments:
print(s.replace(" ",""),"\n")
```
### Notes on prompting
- Use `<no_gap>` for unaligned homologous sequences.
- Use `<gap>` for aligned MSA-style inputs containing gap characters.
- Separate protein sequences with `<s>`.
- Provide up to 15 homologous sequences as context.
- Sampling parameters such as `temperature` and `top_p` can affect sequence quality, diversity, and family consistency.
- Generated sequences should be validated before experimental use.
- Change `max_new_tokens` in `generate()` to control the number of protein generated sequentially (i.e., you are passing 15 homologs as prompt, this should roughly equal the length of a single protein).
- - Use `num_return_sequences` in `generate()` to control the number of protein generated in parallel given the same prompt.
### Out-of-Scope Use
The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.
The model should not be used for irresponsible or harmful biological design applications.
## Bias, Risks, and Limitations
ProtGPT3-MSA learns from public protein sequence and MSA datasets and may reproduce biases present in those datasets. The model depends on the quality, relevance, and diversity of the homologous sequences provided in the prompt. Poor, unrelated, noisy, contaminated, or incorrectly aligned prompts may reduce generation quality.
Generated sequences may be nonfunctional, unstable, insoluble, repetitive, low-complexity, or biologically implausible. As with other generative protein models, ProtGPT3-MSA may present dual-use risks if applied irresponsibly.
### Recommendations
Users should provide high-quality homologous protein sequences and validate generated sequences with appropriate downstream computational and experimental methods. For family-conditioned generation, users should carefully curate prompts and assess generated sequences using task-relevant criteria such as sequence identity, structural confidence, family-level consistency, solubility, and functional plausibility.
## Training Details
### Training Data
ProtGPT3-MSA was trained on approximately 8.5M MSAs from the OpenProteinSet Uniclust30 dataset. From each MSA, 16 sequences were sampled without replacement and concatenated in random order. This process was repeated 15 times for each MSA, resulting in approximately 560B training tokens.
## Technical Specifications
### Model Architecture and Objective
ProtGPT3-MSA is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained to model concatenated sets of related protein sequences, enabling homolog-conditioned generation through prompting.
The model processes up to 16 concatenated protein sequences and supports both aligned and unaligned modalities. During inference, users may provide up to 15 homologous sequences and generate an additional sequence conditioned on the prompt.
## Citation
**BibTeX:**
```bibtex
@article{garibbo2026protgpt3,
title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
author={Garibbo, Michele and Boxo Corominas, Gerard and Stocco, Filippo and Illanes Vicioso, Ramiro and Middendorf, Lasse and Ferruz, Noelia},
journal={bioRxiv},
pages={2026--06},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}
```
## More Information
All models and code are released through the Hugging Face ecosystem and accompanying code repository.
|