File size: 13,909 Bytes
fa69ff4
 
32fba5f
 
 
 
 
 
 
 
 
 
 
fa69ff4
 
32fba5f
fa69ff4
 
 
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
 
 
 
 
fa69ff4
32fba5f
fa69ff4
32fba5f
 
 
fa69ff4
 
 
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
 
 
fa69ff4
32fba5f
fa69ff4
32fba5f
 
 
376da1e
 
 
 
 
 
 
 
 
 
 
 
 
 
bf4fca6
376da1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa69ff4
32fba5f
fa69ff4
376da1e
bbea717
376da1e
32fba5f
 
 
 
 
 
fa69ff4
32fba5f
 
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
 
fa69ff4
376da1e
32fba5f
 
 
 
 
 
fa69ff4
376da1e
fa69ff4
bbea717
fa69ff4
32fba5f
 
e004744
32fba5f
 
 
 
 
6225798
32fba5f
fa69ff4
32fba5f
 
 
fa69ff4
32fba5f
fa69ff4
32fba5f
 
 
 
fa69ff4
376da1e
32fba5f
 
 
 
 
 
bf4fca6
32fba5f
bbea717
32fba5f
 
 
e004744
32fba5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa69ff4
 
 
 
 
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
 
 
 
 
 
 
 
 
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
 
 
 
 
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
 
 
 
 
 
 
fa69ff4
 
 
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
 
 
 
 
 
 
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
32fba5f
fa69ff4
 
 
32fba5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- msa
- multiple-sequence-alignment
- few-shot-prompting
- homolog-conditioned-generation
- causal-lm
- mixture-of-experts
- transformers
---

# Model Card for ProtGPT3-MSA

## Model Details

### Model Description

ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.

Unlike the single-sequence ProtGPT3 checkpoints, ProtGPT3-MSA can be prompted with sets of homologous protein sequences, enabling few-shot, family-conditioned protein generation without task-specific fine-tuning. At inference time, users can provide homologous protein sequences as context and generate additional family-consistent sequences.

ProtGPT3-MSA was initialized from the final ProtGPT3-112M training checkpoint and further trained to autoregressively model sets of 16 concatenated protein sequences. The model supports both aligned and unaligned prompting modes.

- **Developed by:** Anonymous authors
- **Model type:** Autoregressive MSA-promptable protein language model
- **Language(s):** Protein sequences / amino-acid sequences
- **License:** More Information Needed
- **Finetuned from model:** ProtGPT3-112M

### Model Sources

- **Repository:** https://huggingface.co/protgpt3
- **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
- **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md

## Uses

### Direct Use

ProtGPT3-MSA is intended for few-shot, homolog-conditioned protein sequence generation. Users can prompt the model with related protein sequences from a target protein family to generate additional family-consistent sequences.

### Downstream Use

ProtGPT3-MSA can be used in protein design workflows where users have a small set of homologous sequences and want to generate plausible additional sequences from the same family. It may be combined with computational screening, structural prediction, fitness prediction, solubility filtering, or other downstream validation pipelines.

### Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.

The model should not be used for irresponsible or harmful biological design applications.

## Bias, Risks, and Limitations

ProtGPT3-MSA learns from public protein sequence and MSA datasets and may reproduce biases present in those datasets. The model depends on the quality, relevance, and diversity of the homologous sequences provided in the prompt. Poor, unrelated, noisy, contaminated, or incorrectly aligned prompts may reduce generation quality.

Generated sequences may be nonfunctional, unstable, insoluble, repetitive, low-complexity, or biologically implausible. As with other generative protein models, ProtGPT3-MSA may present dual-use risks if applied irresponsibly.

### Recommendations

Users should provide high-quality homologous protein sequences and validate generated sequences with appropriate downstream computational and experimental methods. For family-conditioned generation, users should carefully curate prompts and assess generated sequences using task-relevant criteria such as sequence identity, structural confidence, family-level consistency, solubility, and functional plausibility.

## How to Get Started with the Model

Install dependencies:

```bash
pip install transformers accelerate torch
```

Load the model and tokenizer:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import random
import re

# ---- Intialise useful methods to prompt ProtGPT3-MSA ----
def process_style(seq: str, gap: bool):
    """Remove gaps, uppercase insertions, drop X."""
    if gap:
        # keep gaps
        return re.sub(r"[X]", "", seq.upper())
    else:
        # remove gaps
        return re.sub(r"[X]", "", seq.replace("-", "").upper())

def build_prompt(
    sequences: list, 
    gap: bool = False,
) -> str:
    """Build prompt for ProtGPT3-MSA"""

    random.shuffle(sequences)

    direction = "1" # change this to "2" for reversed C-to-N generation

    if gap:
        gap_token = "<gap>"
        assert all(len(s) == len(sequences[0]) for s in sequences), "Sequences in the prompt have different len(), but should be aligned, either align them or use no_gap mode"
    else:
        gap_token = "<no_gap>"

    tokens: List[str] = ["<|bos|>", direction, gap_token]
    for seq in sequences:
        tokens.append("<s>")
        tokens.extend(list(process_style(seq,gap=gap)))

    # Match train-time separator before continuation
    tokens.append("<s>")
    return " ".join(tokens)
## --------------------------------------

model_id = "protgpt3/ProtGPT3-MSA"  # Replace with the final checkpoint name

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=False, add_eos_token=False, padding_side="left") # BOS token manually added in build_prompt

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()
```

### Few-shot generation with unaligned homologs

Use the `<no_gap>` modality token for unaligned sequences. Separate homologous sequences with the `<s>` separator token.

```python
import torch


homologs = [
    "MKTAYIAKQRQISFVKSHFSRQDILD",
    "MKTVYIAKQRQISFVKSHFSRQDILD",
    "MKTAYIAKQRQINNVKSHFSRQNILD",
    # Add up to 15 homologous protein sequences
]

prompt = build_prompt(sequences=homologs)

inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
print(generated)
```

### Few-shot generation with aligned homologs

Use the `<gap>` modality token for aligned sequences. Gap characters may be included in the prompted sequences.

```python
import torch

# must have the same length and be aligned
aligned_homologs = [
    "MKTAYIAKQRQI--SFVKSHFSRQDILD",
    "MKTVYIAKQRQI--SFVKSHFSRQDILD",
    "MKTAYIAKQRQINNSFVKSHFSRQNILD",
]

prompt = build_prompt(sequences=aligned_homologs, gap=True)

inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
print(generated)
```

### Extracting the newly generated sequence

Depending on tokenizer behavior and special-token handling, the decoded output may include the full prompt plus the continuation. A simple post-processing approach is to split on the sequence separator token and inspect the final generated segment:

```python
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=False)

segments = decoded.split("<s>")
generated_sequence = segments[-1].replace(tokenizer.eos_token or "", "").strip()

print(generated_sequence)
```

### Notes on prompting

- Use `<no_gap>` for unaligned homologous sequences.
- Use `<gap>` for aligned MSA-style inputs containing gap characters.
- Separate protein sequences with `<s>`.
- Provide up to 15 homologous sequences as context.
- Sampling parameters such as `temperature` and `top_p` can affect sequence quality, diversity, and family consistency.
- Generated sequences should be validated before experimental use.

## Training Details

### Training Data

ProtGPT3-MSA was trained on approximately 8.5M MSAs from the OpenProteinSet Uniclust30 dataset. From each MSA, 16 sequences were sampled without replacement and concatenated in random order. This process was repeated 15 times for each MSA, resulting in approximately 560B training tokens.

### Training Procedure

#### Preprocessing

Each training example consisted of 16 concatenated protein sequences sampled from the same MSA. A special sequence separator token, `<s>`, was used to mark sequence boundaries.

Training included both aligned and unaligned modalities:

- `<gap>`: aligned modality, where sequences include gap tokens
- `<no_gap>`: unaligned modality, where sequences are provided without gaps

The model was trained autoregressively to predict concatenated protein sequences token by token.

#### Training Hyperparameters

- **Training regime:** bfloat16
- **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
- **Maximum sequence length:** 16,384
- **Optimizer:** AdamW
- **Learning rate:** 2e-4
- **Optimizer betas:** β1 = 0.9, β2 = 0.95
- **Weight decay:** 0.1
- **Gradient clipping:** 1.0
- **Gradient accumulation steps:** 16
- **Maximum tokens per batch:** 100,000
- **Router auxiliary loss coefficient:** 0.05
- **Number of training GPUs:** 4
- **Precision:** bfloat16

#### Speeds, Sizes, Times

- **Model size:** 112M parameters
- **Training tokens:** Approximately 560B
- **Training MSAs:** Approximately 8.5M

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

ProtGPT3-MSA was evaluated on held-out protein families, ProteinGym, DMS stability libraries, held-out validation MSAs, PDB-derived MSAs, and targeted enzyme-generation case studies.

#### Factors

Evaluation considered family-conditioned generation quality across different protein families, MSA depths, prompt compositions, aligned versus unaligned prompting, and sampling settings.

#### Metrics

Evaluation included:

- ProteinGym Spearman correlation
- Sequence identity to held-out reference sequences
- Predicted structure confidence
- TM-score
- HHM profile comparison
- Positional KL-divergence
- DMS hit rate
- Computational success rate in targeted enzyme-generation case studies
- Experimental expression and purification outcomes for selected designs

### Results

ProtGPT3-MSA supports family-conditioned generation from small sets of homologous sequences. In the paper, prompting ProtGPT3-MSA with as few as 15 homologs produced family-consistent generations and compared favorably to supervised fine-tuning of single-sequence models.

In a low-data defluorinase case study using seven experimentally annotated sequences, ProtGPT3-MSA achieved substantially higher computational success rates than fine-tuned single-sequence baselines and produced designs that were soluble and expressed after experimental validation.

#### Summary

ProtGPT3-MSA enables prompt-based protein family conditioning without updating model weights. This makes it suitable for low-data protein design settings where a small number of homologous sequences are available.

## Model Examination

ProtGPT3-MSA was examined for few-shot family-conditioned generation, aligned versus unaligned prompting, prompt ensembling, stability-aware generation, and inference-time steering using Feynman-Kac-style sequential Monte Carlo sampling.

## Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

- **Hardware Type:** NVIDIA H100 GPUs
- **Hours used:** More Information Needed
- **Cloud Provider:** More Information Needed
- **Compute Region:** More Information Needed
- **Carbon Emitted:** More Information Needed

## Technical Specifications

### Model Architecture and Objective

ProtGPT3-MSA is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained to model concatenated sets of related protein sequences, enabling homolog-conditioned generation through prompting.

The model processes up to 16 concatenated protein sequences and supports both aligned and unaligned modalities. During inference, users may provide up to 15 homologous sequences and generate an additional sequence conditioned on the prompt.

### Compute Infrastructure

#### Hardware

The model was trained on NVIDIA H100 GPUs.

#### Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

## Citation

**BibTeX:**

```bibtex
@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}
```

**APA:**

Anonymous Authors. (2026). *ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models*.

## Glossary

- **MSA:** Multiple sequence alignment, a collection of related protein sequences aligned by residue position.
- **Homologs:** Evolutionarily related protein sequences.
- **Few-shot prompting:** Conditioning a model on a small number of examples at inference time without updating model weights.
- **Causal language modeling:** Autoregressive prediction of the next token given previous tokens.
- **TM-score:** A metric for structural similarity between protein structures.
- **pLDDT:** A predicted local structure confidence score.
- **KL-divergence:** A measure of difference between probability distributions, used here to compare generated and reference family residue distributions.

## More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

## Model Card Authors

Anonymous authors

## Model Card Contact

More Information Needed