--- library_name: transformers tags: - biology - protein-language-model - protein-generation - causal-lm - mixture-of-experts - transformers --- # Model Card for ProtGPT3-112M ## Model Details ### Model Description ProtGPT3-112M is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences. The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows. - **Developed by:** Anonymous authors - **Model type:** Autoregressive protein language model; causal decoder-only Mixture-of-Experts model - **Language(s):** Protein sequences / amino-acid sequences - **License:** More Information Needed - **Finetuned from model:** Not applicable / pretrained from scratch ### Model Sources - **Repository:** https://huggingface.co/protgpt3 - **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models - **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md ## Uses ### Direct Use ProtGPT3-112M can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix. ### Downstream Use The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines. ### Out-of-Scope Use The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins. ## Bias, Risks, and Limitations ProtGPT3-112M learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly. ### Recommendations Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design. ## How to Get Started with the Model Install dependencies: ```bash pip install transformers accelerate torch ``` Load the model and tokenizer: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "protgpt3/ProtGPT3-112M" # Replace with the final checkpoint name # Load tokenizer for generation tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left") model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) model.eval() ``` ### Generate a protein sequence ```python import torch prompt = "" # Optionally provide an amino-acid prefix or model-specific direction inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate( inputs["input_ids"], max_new_tokens=512, do_sample=True, temperature=0.8, top_p=0.9, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N ``` ### Generate from an amino-acid prefix ```python import torch # forward N-to-C generation with special token "1" prefix = "1MKT" # use special token "2" instead of "1" for reverse C-to-N generation inputs = tokenizer(prefix, return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate( inputs["input_ids"], max_new_tokens=256, do_sample=True, temperature=0.8, top_p=0.9, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.eos_token_id, ) sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(sequence) ``` ### Batch generation ```python import torch prompts = [ "", "1MKT", # N-to-C generation "2MAV", # C-to-N generation ] inputs = tokenizer( prompts, return_tensors="pt", padding=True, ).to(model.device) with torch.no_grad(): output_ids = model.generate( inputs["input_ids"], max_new_tokens=256, do_sample=True, temperature=0.8, top_p=0.9, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.bos_token_id, ) sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True) for sequence in sequences: print(sequence) ``` ## Training Details ### Training Data ProtGPT3-112M was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 112M-parameter model used approximately 15M UniRef90 sequences and 28M GigaRef sequences, corresponding to approximately 9.8B training tokens. ### Training Procedure #### Preprocessing Protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction. #### Training Hyperparameters - **Training regime:** bfloat16 - **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder - **Maximum sequence length:** 1024 - **Optimizer:** AdamW - **Learning rate:** 5e-4 - **Weight decay:** 0.1 - **Gradient clipping:** 1.0 - **Batch size:** 500 - **Number of training GPUs:** 4 ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The model was evaluated on held-out protein sequences with at most 50% sequence identity to the training set. It was also benchmarked on ProteinGym. #### Metrics Evaluation included validation perplexity, sequence diversity, predicted pLDDT, proportion of terminating sequences, proportion of low-complexity sequences, and ProteinGym Spearman correlation. ### Results Larger ProtGPT3 single-sequence models showed improved perplexity, sequence quality, and diversity. ProtGPT3-112M serves as the smallest single-sequence model in the family and provides a computationally accessible checkpoint for protein generation. ## Technical Specifications ### Model Architecture and Objective ProtGPT3-112M is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences. ### Compute Infrastructure #### Hardware NVIDIA H100 GPUs. #### Software Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed. ## Citation **BibTeX:** ```bibtex @article{protgpt3, title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models}, author={Anonymous Authors}, year={2026} } ``` ## More Information All models and code are released through the Hugging Face ecosystem and accompanying code repository. ## Model Card Authors Anonymous authors ## Model Card Contact Anonymous authors