Update README.md

bb56f0a verified 13 days ago

7.75 kB

	---
	library_name: transformers
	tags:
	- biology
	- protein-language-model
	- protein-generation
	- causal-lm
	- mixture-of-experts
	- transformers
	---

	# Model Card for ProtGPT3-112M

	## Model Details

	### Model Description

	ProtGPT3-112M is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences.

	The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows.

	- Developed by: Anonymous authors
	- Model type: Autoregressive protein language model; causal decoder-only Mixture-of-Experts model
	- Language(s): Protein sequences / amino-acid sequences
	- License: More Information Needed
	- Finetuned from model: Not applicable / pretrained from scratch

	### Model Sources

	- Repository: https://huggingface.co/protgpt3
	- Paper: ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
	- Code: https://anonymous.4open.science/r/protGPT3-2053/README.md

	## Uses

	### Direct Use

	ProtGPT3-112M can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix.

	### Downstream Use

	The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines.

	### Out-of-Scope Use

	The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins.

	## Bias, Risks, and Limitations

	ProtGPT3-112M learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly.

	### Recommendations

	Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design.

	## How to Get Started with the Model

	Install dependencies:

	```bash
	pip install transformers accelerate torch
	```

	Load the model and tokenizer:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "protgpt3/ProtGPT3-112M" # Replace with the final checkpoint name

	# Load tokenizer for generation
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left")

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	model.eval()
	```

	### Generate a protein sequence

	```python
	import torch

	prompt = "" # Optionally provide an amino-acid prefix or model-specific direction

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output_ids = model.generate(
	inputs["input_ids"],
	max_new_tokens=512,
	do_sample=True,
	temperature=0.8,
	top_p=0.9,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.pad_token_id,
	)

	sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N
	```

	### Generate from an amino-acid prefix

	```python
	import torch

	# forward N-to-C generation with special token "1"
	prefix = "1MKT" # use special token "2" instead of "1" for reverse C-to-N generation

	inputs = tokenizer(prefix, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output_ids = model.generate(
	inputs["input_ids"],
	max_new_tokens=256,
	do_sample=True,
	temperature=0.8,
	top_p=0.9,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id,
	)

	sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(sequence)
	```

	### Batch generation

	```python
	import torch

	prompts = [
	"",
	"1MKT", # N-to-C generation
	"2MAV", # C-to-N generation
	]

	inputs = tokenizer(
	prompts,
	return_tensors="pt",
	padding=True,
	).to(model.device)

	with torch.no_grad():
	output_ids = model.generate(
	inputs["input_ids"],
	max_new_tokens=256,
	do_sample=True,
	temperature=0.8,
	top_p=0.9,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.bos_token_id,
	)

	sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

	for sequence in sequences:
	print(sequence)
	```

	## Training Details

	### Training Data

	ProtGPT3-112M was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 112M-parameter model used approximately 15M UniRef90 sequences and 28M GigaRef sequences, corresponding to approximately 9.8B training tokens.

	### Training Procedure

	#### Preprocessing

	Protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction.

	#### Training Hyperparameters

	- Training regime: bfloat16
	- Architecture: Mixtral-style sparse Mixture-of-Experts causal decoder
	- Maximum sequence length: 1024
	- Optimizer: AdamW
	- Learning rate: 5e-4
	- Weight decay: 0.1
	- Gradient clipping: 1.0
	- Batch size: 500
	- Number of training GPUs: 4

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	The model was evaluated on held-out protein sequences with at most 50% sequence identity to the training set. It was also benchmarked on ProteinGym.

	#### Metrics

	Evaluation included validation perplexity, sequence diversity, predicted pLDDT, proportion of terminating sequences, proportion of low-complexity sequences, and ProteinGym Spearman correlation.

	### Results

	Larger ProtGPT3 single-sequence models showed improved perplexity, sequence quality, and diversity. ProtGPT3-112M serves as the smallest single-sequence model in the family and provides a computationally accessible checkpoint for protein generation.

	## Technical Specifications

	### Model Architecture and Objective

	ProtGPT3-112M is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences.

	### Compute Infrastructure

	#### Hardware

	NVIDIA H100 GPUs.

	#### Software

	Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

	## Citation

	BibTeX:

	```bibtex
	@article{protgpt3,
	title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
	author={Anonymous Authors},
	year={2026}
	}
	```

	## More Information

	All models and code are released through the Hugging Face ecosystem and accompanying code repository.

	## Model Card Authors

	Anonymous authors

	## Model Card Contact

	Anonymous authors