File size: 7,754 Bytes

---
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- causal-lm
- mixture-of-experts
- transformers
---

# Model Card for ProtGPT3-112M

## Model Details

### Model Description

ProtGPT3-112M is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences.

The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows.

- **Developed by:** Anonymous authors
- **Model type:** Autoregressive protein language model; causal decoder-only Mixture-of-Experts model
- **Language(s):** Protein sequences / amino-acid sequences
- **License:** More Information Needed
- **Finetuned from model:** Not applicable / pretrained from scratch

### Model Sources

- **Repository:** https://huggingface.co/protgpt3
- **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
- **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md

## Uses

### Direct Use

ProtGPT3-112M can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix.

### Downstream Use

The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines.

### Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins.

## Bias, Risks, and Limitations

ProtGPT3-112M learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly.

### Recommendations

Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design.

## How to Get Started with the Model

Install dependencies:

```bash
pip install transformers accelerate torch
```

Load the model and tokenizer:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "protgpt3/ProtGPT3-112M"  # Replace with the final checkpoint name

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()
```

### Generate a protein sequence

```python
import torch

prompt = ""  # Optionally provide an amino-acid prefix or model-specific direction

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N
```

### Generate from an amino-acid prefix

```python
import torch

# forward N-to-C generation with special token "1" 
prefix = "1MKT" # use special token "2" instead of "1" for reverse  C-to-N generation

inputs = tokenizer(prefix, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence)
```

### Batch generation

```python
import torch

prompts = [
    "",
    "1MKT", # N-to-C generation
    "2MAV", # C-to-N generation
]

inputs = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.bos_token_id,
    )

sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

for sequence in sequences:
    print(sequence)
```

## Training Details

### Training Data

ProtGPT3-112M was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 112M-parameter model used approximately 15M UniRef90 sequences and 28M GigaRef sequences, corresponding to approximately 9.8B training tokens.

### Training Procedure

#### Preprocessing

Protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction.

#### Training Hyperparameters

- **Training regime:** bfloat16
- **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
- **Maximum sequence length:** 1024
- **Optimizer:** AdamW
- **Learning rate:** 5e-4
- **Weight decay:** 0.1
- **Gradient clipping:** 1.0
- **Batch size:** 500
- **Number of training GPUs:** 4

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model was evaluated on held-out protein sequences with at most 50% sequence identity to the training set. It was also benchmarked on ProteinGym.

#### Metrics

Evaluation included validation perplexity, sequence diversity, predicted pLDDT, proportion of terminating sequences, proportion of low-complexity sequences, and ProteinGym Spearman correlation.

### Results

Larger ProtGPT3 single-sequence models showed improved perplexity, sequence quality, and diversity. ProtGPT3-112M serves as the smallest single-sequence model in the family and provides a computationally accessible checkpoint for protein generation.

## Technical Specifications

### Model Architecture and Objective

ProtGPT3-112M is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences.

### Compute Infrastructure

#### Hardware

NVIDIA H100 GPUs.

#### Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

## Citation

**BibTeX:**

```bibtex
@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}
```

## More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

## Model Card Authors

Anonymous authors

## Model Card Contact

Anonymous authors