ProtGPT3-112M / README.md
protgpt3's picture
Update README.md
bb56f0a verified
---
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- causal-lm
- mixture-of-experts
- transformers
---
# Model Card for ProtGPT3-112M
## Model Details
### Model Description
ProtGPT3-112M is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences.
The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows.
- **Developed by:** Anonymous authors
- **Model type:** Autoregressive protein language model; causal decoder-only Mixture-of-Experts model
- **Language(s):** Protein sequences / amino-acid sequences
- **License:** More Information Needed
- **Finetuned from model:** Not applicable / pretrained from scratch
### Model Sources
- **Repository:** https://huggingface.co/protgpt3
- **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
- **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md
## Uses
### Direct Use
ProtGPT3-112M can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix.
### Downstream Use
The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines.
### Out-of-Scope Use
The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins.
## Bias, Risks, and Limitations
ProtGPT3-112M learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly.
### Recommendations
Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design.
## How to Get Started with the Model
Install dependencies:
```bash
pip install transformers accelerate torch
```
Load the model and tokenizer:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "protgpt3/ProtGPT3-112M" # Replace with the final checkpoint name
# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
```
### Generate a protein sequence
```python
import torch
prompt = "" # Optionally provide an amino-acid prefix or model-specific direction
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=512,
do_sample=True,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N
```
### Generate from an amino-acid prefix
```python
import torch
# forward N-to-C generation with special token "1"
prefix = "1MKT" # use special token "2" instead of "1" for reverse C-to-N generation
inputs = tokenizer(prefix, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=256,
do_sample=True,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence)
```
### Batch generation
```python
import torch
prompts = [
"",
"1MKT", # N-to-C generation
"2MAV", # C-to-N generation
]
inputs = tokenizer(
prompts,
return_tensors="pt",
padding=True,
).to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=256,
do_sample=True,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.bos_token_id,
)
sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for sequence in sequences:
print(sequence)
```
## Training Details
### Training Data
ProtGPT3-112M was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 112M-parameter model used approximately 15M UniRef90 sequences and 28M GigaRef sequences, corresponding to approximately 9.8B training tokens.
### Training Procedure
#### Preprocessing
Protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction.
#### Training Hyperparameters
- **Training regime:** bfloat16
- **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
- **Maximum sequence length:** 1024
- **Optimizer:** AdamW
- **Learning rate:** 5e-4
- **Weight decay:** 0.1
- **Gradient clipping:** 1.0
- **Batch size:** 500
- **Number of training GPUs:** 4
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The model was evaluated on held-out protein sequences with at most 50% sequence identity to the training set. It was also benchmarked on ProteinGym.
#### Metrics
Evaluation included validation perplexity, sequence diversity, predicted pLDDT, proportion of terminating sequences, proportion of low-complexity sequences, and ProteinGym Spearman correlation.
### Results
Larger ProtGPT3 single-sequence models showed improved perplexity, sequence quality, and diversity. ProtGPT3-112M serves as the smallest single-sequence model in the family and provides a computationally accessible checkpoint for protein generation.
## Technical Specifications
### Model Architecture and Objective
ProtGPT3-112M is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences.
### Compute Infrastructure
#### Hardware
NVIDIA H100 GPUs.
#### Software
Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.
## Citation
**BibTeX:**
```bibtex
@article{protgpt3,
title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
author={Anonymous Authors},
year={2026}
}
```
## More Information
All models and code are released through the Hugging Face ecosystem and accompanying code repository.
## Model Card Authors
Anonymous authors
## Model Card Contact
Anonymous authors