Model Card for ProtGPT3-112M-dpo

Model Details

Model Description

ProtGPT3-112M-dpo is a DPO-aligned single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein design.

The base ProtGPT3-112M model is a causal decoder-only protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained for causal language modeling on protein sequences and supports generation in both N-to-C and C-to-N directions using special directional tokens.

This checkpoint was further aligned with Direct Preference Optimization (DPO) to improve generation quality. The alignment procedure shifts the model toward protein sequences with higher predicted structural confidence and reduced low-complexity content, while preserving sequence diversity.

  • Developed by: Anonymous authors
  • Model type: DPO-aligned autoregressive protein language model; causal decoder-only Mixture-of-Experts model
  • Language(s): Protein sequences / amino-acid sequences
  • License: More Information Needed
  • Finetuned from model: protgpt3/ProtGPT3-112M

Model Sources

Uses

Direct Use

ProtGPT3-112M-dpo can be used for single-sequence autoregressive protein generation. Users can generate protein sequences unconditionally or condition generation on an amino-acid prefix.

Compared with the base ProtGPT3-112M checkpoint, this DPO-aligned model is intended for users who want generations biased toward higher-complexity sequences with improved predicted structural confidence.

Downstream Use

The model may be used in protein design workflows, computational screening pipelines, protein variant generation, and candidate sequence proposal. Generated sequences can be further evaluated with structure prediction, sequence-complexity filters, solubility filters, fitness predictors, or experimental validation.

Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.

The model should not be used for irresponsible or harmful biological design applications.

Bias, Risks, and Limitations

ProtGPT3-112M-dpo learns from public protein sequence datasets and may reproduce biases present in those datasets. Although DPO alignment reduces low-complexity generations and improves generation quality according to the alignment objectives (pLDDT and reduction of lcr, as a binary objective, see main manuscript), generated sequences may still be nonfunctional, unstable, insoluble, repetitive, biologically implausible, or unsuitable for a user’s intended application.

The DPO alignment objective uses predicted structural confidence and low-complexity filtering as proxy objectives. These proxies do not guarantee biological function, experimental success, safety, solubility, or manufacturability.

As with other generative protein models, ProtGPT3-112M-dpo may present dual-use risks if applied irresponsibly.

Recommendations

Users should validate generated sequences with appropriate downstream computational and experimental methods. Recommended checks include sequence-complexity filtering, structure prediction, predicted confidence scoring, similarity searches against known proteins, solubility assessment, and task-specific functional evaluation.

How to Get Started with the Model

Install dependencies:

pip install transformers accelerate torch

Load the model and tokenizer:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "protgpt3/ProtGPT3-112M-dpo"

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()

Generate a protein sequence

import torch

prompt = ""  # Optionally provide an amino-acid prefix or model-specific direction

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N

Generate from an amino-acid prefix

import torch

# forward N-to-C generation with special token "1" 
prefix = "1MKT" # use special token "2" instead of "1" for reverse  C-to-N generation

inputs = tokenizer(prefix, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence)

Batch generation

import torch

prompts = [
    "",
    "1MKT", # N-to-C generation
    "2MAV", # C-to-N generation
]

inputs = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.bos_token_id,
    )

sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

for sequence in sequences:
    print(sequence)

Notes on generation

  • Use this checkpoint for single-sequence protein generation.
  • Sampling parameters such as temperature and top_p can strongly affect sequence quality and diversity.
  • Lower temperatures may produce more conservative sequences.
  • Higher temperatures may increase diversity but can also increase failure modes.
  • Generated sequences should be validated before experimental use.

Training Details

Training Data

The base ProtGPT3-112M model was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 112M-parameter model used approximately 64M UniRef90 sequences and 120M GigaRef sequences, corresponding to approximately 43B training tokens.

The DPO alignment dataset was constructed from model-generated sequences. Sequences were scored using predicted structural confidence and low-complexity-region content. Sequences with pLDDT greater than 0.7 and fewer than 25% low-complexity residues were treated as positive examples, while the remaining generations were treated as negative examples.

Training Procedure

Preprocessing

For base-model pretraining, protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction.

For DPO alignment, generated sequences were classified as pass or fail according to predicted pLDDT and low-complexity-region thresholds. Pass and fail sequences were clustered separately at 50% sequence identity and 0.8 coverage. Preference pairs were constructed by pairing positive and negative examples with matched sequence lengths, helping prevent the model from learning sequence length as a shortcut.

Training Hyperparameters

Base ProtGPT3-112M pretraining:

  • Training regime: bfloat16
  • Architecture: Mixtral-style sparse Mixture-of-Experts causal decoder
  • Maximum sequence length: 1024
  • Optimizer: AdamW
  • Learning rate: 2.5e-4
  • Optimizer betas: β1 = 0.9, β2 = 0.999
  • Weight decay: 0.1
  • Gradient clipping: 1.0
  • Gradient accumulation steps: 4
  • Batch size: 100
  • Router auxiliary loss coefficient: 0.05
  • Number of training GPUs: 16
  • Precision: bfloat16

DPO alignment:

  • Alignment method: Direct Preference Optimization
  • Positive-example criterion: pLDDT > 0.7 and low-complexity regions < 25%
  • Negative-example criterion: all other generated sequences
  • Pairing strategy: length-matched positive and negative sequence pairs
  • Preference-data clustering: 50% sequence identity, 0.8 coverage
  • Alignment objective: shift the model toward higher-complexity, higher-pLDDT generations

Speeds, Sizes, Times

  • Model size: 112M parameters
  • Base-model training tokens: Approximately 43B
  • Hardware: NVIDIA H100 GPUs

Evaluation

Testing Data, Factors & Metrics

Testing Data

ProtGPT3 models were evaluated on held-out protein sequences with at most 50% sequence identity to the training set. The model family was also benchmarked on ProteinGym and assessed for generation quality across sampling settings.

The DPO-aligned models were evaluated on generated sequences and on naturally occurring protein sequences from PDB-derived data to assess whether the alignment objective generalized beyond the model-generated preference data.

Factors

Evaluation considered model scale, sampling temperature, nucleus sampling parameter top_p, sequence direction, predicted structure confidence, low-complexity-region content, and sequence diversity.

Metrics

Evaluation included:

  • Validation perplexity
  • ProteinGym Spearman correlation
  • Predicted pLDDT
  • Fraction of low-complexity generations
  • Sequence diversity
  • Fraction of sequences passing the pLDDT and low-complexity filters
  • Intrinsic reward discrimination between high-quality and low-quality natural sequences

Results

DPO alignment improved generation quality across the ProtGPT3 single-sequence model family. Alignment reduced the fraction of low-complexity generations while preserving high predicted structural confidence and sequence diversity.

For the 112M-scale model, DPO alignment increased the pass rate of generated sequences under the pLDDT and low-complexity criteria. The paper reports that alignment reduced low-complexity generations by more than 20% for the 112M and 1B-scale models, while preserving diversity and causing little change in held-out pretraining perplexity.

Summary

ProtGPT3-112M-dpo is the DPO-aligned version of ProtGPT3-112M. It is intended for users who want a single-sequence protein generator biased toward higher-complexity and higher-predicted-confidence generations compared with the base checkpoint.

Model Examination

ProtGPT3-112M-dpo was examined as part of the ProtGPT3 alignment study. The DPO alignment pipeline was designed to reduce repetitive or low-complexity protein generations while maintaining diversity and preserving base-model knowledge.

The aligned models were also examined using an intrinsic reward discrimination analysis on real protein sequences, where aligned models assigned systematically higher intrinsic rewards to high-quality sequences than to low-quality sequences.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA H100 GPUs
  • Hours used: More Information Needed
  • Cloud Provider: More Information Needed
  • Compute Region: More Information Needed
  • Carbon Emitted: More Information Needed

Technical Specifications

Model Architecture and Objective

ProtGPT3-112M-dpo is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. The base model was trained with a causal language modeling objective on protein sequences.

The DPO-aligned checkpoint was optimized to prefer generated sequences with higher predicted structural confidence and lower low-complexity-region content.

Compute Infrastructure

Hardware

The base ProtGPT3-112M model was trained on NVIDIA H100 GPUs.

Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

Citation

BibTeX:

@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}

APA:

Anonymous Authors. (2026). ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models.

Glossary

  • DPO: Direct Preference Optimization, an alignment method that optimizes a model using preference pairs.
  • pLDDT: A predicted local structure confidence score.
  • Low-complexity region: A repetitive or compositionally simple sequence region.
  • Causal language modeling: Autoregressive prediction of the next token given previous tokens.
  • Mixture-of-Experts: A sparse neural architecture using multiple expert subnetworks.
  • N-to-C / C-to-N: Protein sequence generation directions from N-terminus to C-terminus or C-terminus to N-terminus.

More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

Model Card Authors

Anonymous authors

Model Card Contact

More Information Needed

Downloads last month
110
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support