protgpt3
/

ProtGPT3-MSA

@@ -1,199 +1,325 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
 #### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
 #### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
-[More Information Needed]
 **APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+tags:
+- biology
+- protein-language-model
+- protein-generation
+- msa
+- multiple-sequence-alignment
+- few-shot-prompting
+- homolog-conditioned-generation
+- causal-lm
+- mixture-of-experts
+- transformers
 ---
+# Model Card for ProtGPT3-MSA
 ## Model Details
 ### Model Description
+ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.
+Unlike the single-sequence ProtGPT3 checkpoints, ProtGPT3-MSA can be prompted with sets of homologous protein sequences, enabling few-shot, family-conditioned protein generation without task-specific fine-tuning. At inference time, users can provide homologous protein sequences as context and generate additional family-consistent sequences.
+ProtGPT3-MSA was initialized from the final ProtGPT3-112M training checkpoint and further trained to autoregressively model sets of 16 concatenated protein sequences. The model supports both aligned and unaligned prompting modes.
+- **Developed by:** Anonymous authors
+- **Model type:** Autoregressive MSA-promptable protein language model
+- **Language(s):** Protein sequences / amino-acid sequences
+- **License:** More Information Needed
+- **Finetuned from model:** ProtGPT3-112M
+### Model Sources
+- **Repository:** https://huggingface.co/protgpt3
+- **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
+- **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md
 ## Uses
 ### Direct Use
+ProtGPT3-MSA is intended for few-shot, homolog-conditioned protein sequence generation. Users can prompt the model with related protein sequences from a target protein family to generate additional family-consistent sequences.
+### Downstream Use
+ProtGPT3-MSA can be used in protein design workflows where users have a small set of homologous sequences and want to generate plausible additional sequences from the same family. It may be combined with computational screening, structural prediction, fitness prediction, solubility filtering, or other downstream validation pipelines.
 ### Out-of-Scope Use
+The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.
+The model should not be used for irresponsible or harmful biological design applications.
 ## Bias, Risks, and Limitations
+ProtGPT3-MSA learns from public protein sequence and MSA datasets and may reproduce biases present in those datasets. The model depends on the quality, relevance, and diversity of the homologous sequences provided in the prompt. Poor, unrelated, noisy, contaminated, or incorrectly aligned prompts may reduce generation quality.
+Generated sequences may be nonfunctional, unstable, insoluble, repetitive, low-complexity, or biologically implausible. As with other generative protein models, ProtGPT3-MSA may present dual-use risks if applied irresponsibly.
 ### Recommendations
+Users should provide high-quality homologous protein sequences and validate generated sequences with appropriate downstream computational and experimental methods. For family-conditioned generation, users should carefully curate prompts and assess generated sequences using task-relevant criteria such as sequence identity, structural confidence, family-level consistency, solubility, and functional plausibility.
 ## How to Get Started with the Model
+Install dependencies:
+```bash
+pip install transformers accelerate torch
+```
+Load the model and tokenizer:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "protgpt3/ProtGPT3-MSA"  # Replace with the final checkpoint name
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+model.eval()
+```
+### Few-shot generation with unaligned homologs
+Use the `<no_gap>` modality token for unaligned sequences. Separate homologous sequences with the `<s>` separator token.
+```python
+import torch
+homologs = [
+    "MKTAYIAKQRQISFVKSHFSRQDILD",
+    "MKTVYIAKQRQISFVKSHFSRQDILD",
+    "MKTAYIAKQRQINNVKSHFSRQNILD",
+    # Add up to 15 homologous protein sequences
+]
+prompt = "<no_gap>" + "<s>".join(homologs) + "<s>"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.9,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
+print(generated)
+```
+### Few-shot generation with aligned homologs
+Use the `<gap>` modality token for aligned sequences. Gap characters may be included in the prompted sequences.
+```python
+import torch
+aligned_homologs = [
+    "MKTAYIAKQRQI--SFVKSHFSRQDILD",
+    "MKTVYIAKQRQI--SFVKSHFSRQDILD",
+    "MKTAYIAKQRQINNSFVKSHFSRQNILD",
+]
+prompt = "<gap>" + "<s>".join(aligned_homologs) + "<s>"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.9,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
+print(generated)
+```
+### Extracting the newly generated sequence
+Depending on tokenizer behavior and special-token handling, the decoded output may include the full prompt plus the continuation. A simple post-processing approach is to split on the sequence separator token and inspect the final generated segment:
+```python
+decoded = tokenizer.decode(output_ids[0], skip_special_tokens=False)
+segments = decoded.split("<s>")
+generated_sequence = segments[-1].replace(tokenizer.eos_token or "", "").strip()
+print(generated_sequence)
+```
+### Notes on prompting
+- Use `<no_gap>` for unaligned homologous sequences.
+- Use `<gap>` for aligned MSA-style inputs containing gap characters.
+- Separate protein sequences with `<s>`.
+- Provide up to 15 homologous sequences as context.
+- Sampling parameters such as `temperature` and `top_p` can affect sequence quality, diversity, and family consistency.
+- Generated sequences should be validated before experimental use.
+## Training Details
+### Training Data
+ProtGPT3-MSA was trained on approximately 8.5M MSAs from the OpenProteinSet Uniclust30 dataset. From each MSA, 16 sequences were sampled without replacement and concatenated in random order. This process was repeated 15 times for each MSA, resulting in approximately 560B training tokens.
+### Training Procedure
+#### Preprocessing
+Each training example consisted of 16 concatenated protein sequences sampled from the same MSA. A special sequence separator token, `<s>`, was used to mark sequence boundaries.
+Training included both aligned and unaligned modalities:
+- `<gap>`: aligned modality, where sequences include gap tokens
+- `<no_gap>`: unaligned modality, where sequences are provided without gaps
+The model was trained autoregressively to predict concatenated protein sequences token by token.
+#### Training Hyperparameters
+- **Training regime:** bfloat16
+- **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
+- **Maximum sequence length:** 16,384
+- **Optimizer:** AdamW
+- **Learning rate:** 2e-4
+- **Optimizer betas:** β1 = 0.9, β2 = 0.95
+- **Weight decay:** 0.1
+- **Gradient clipping:** 1.0
+- **Gradient accumulation steps:** 16
+- **Maximum tokens per batch:** 100,000
+- **Router auxiliary loss coefficient:** 0.05
+- **Number of training GPUs:** 4
+- **Precision:** bfloat16
+#### Speeds, Sizes, Times
+- **Model size:** 112M parameters
+- **Training tokens:** Approximately 560B
+- **Training MSAs:** Approximately 8.5M
+## Evaluation
 ### Testing Data, Factors & Metrics
 #### Testing Data
+ProtGPT3-MSA was evaluated on held-out protein families, ProteinGym, DMS stability libraries, held-out validation MSAs, PDB-derived MSAs, and targeted enzyme-generation case studies.
 #### Factors
+Evaluation considered family-conditioned generation quality across different protein families, MSA depths, prompt compositions, aligned versus unaligned prompting, and sampling settings.
 #### Metrics
+Evaluation included:
+- ProteinGym Spearman correlation
+- Sequence identity to held-out reference sequences
+- Predicted structure confidence
+- TM-score
+- HHM profile comparison
+- Positional KL-divergence
+- DMS hit rate
+- Computational success rate in targeted enzyme-generation case studies
+- Experimental expression and purification outcomes for selected designs
 ### Results
+ProtGPT3-MSA supports family-conditioned generation from small sets of homologous sequences. In the paper, prompting ProtGPT3-MSA with as few as 15 homologs produced family-consistent generations and compared favorably to supervised fine-tuning of single-sequence models.
+In a low-data defluorinase case study using seven experimentally annotated sequences, ProtGPT3-MSA achieved substantially higher computational success rates than fine-tuned single-sequence baselines and produced designs that were soluble and expressed after experimental validation.
+#### Summary
+ProtGPT3-MSA enables prompt-based protein family conditioning without updating model weights. This makes it suitable for low-data protein design settings where a small number of homologous sequences are available.
+## Model Examination
+ProtGPT3-MSA was examined for few-shot family-conditioned generation, aligned versus unaligned prompting, prompt ensembling, stability-aware generation, and inference-time steering using Feynman-Kac-style sequential Monte Carlo sampling.
 ## Environmental Impact
+Carbon emissions can be estimated using the Machine Learning Impact calculator.
+- **Hardware Type:** NVIDIA H100 GPUs
+- **Hours used:** More Information Needed
+- **Cloud Provider:** More Information Needed
+- **Compute Region:** More Information Needed
+- **Carbon Emitted:** More Information Needed
+## Technical Specifications
 ### Model Architecture and Objective
+ProtGPT3-MSA is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained to model concatenated sets of related protein sequences, enabling homolog-conditioned generation through prompting.
+The model processes up to 16 concatenated protein sequences and supports both aligned and unaligned modalities. During inference, users may provide up to 15 homologous sequences and generate an additional sequence conditioned on the prompt.
+### Compute Infrastructure
 #### Hardware
+The model was trained on NVIDIA H100 GPUs.
 #### Software
+Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.
+## Citation
 **BibTeX:**
+```bibtex
+@article{protgpt3,
+  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
+  author={Anonymous Authors},
+  year={2026}
+}
+```
 **APA:**
+Anonymous Authors. (2026). *ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models*.
+## Glossary
+- **MSA:** Multiple sequence alignment, a collection of related protein sequences aligned by residue position.
+- **Homologs:** Evolutionarily related protein sequences.
+- **Few-shot prompting:** Conditioning a model on a small number of examples at inference time without updating model weights.
+- **Causal language modeling:** Autoregressive prediction of the next token given previous tokens.
+- **TM-score:** A metric for structural similarity between protein structures.
+- **pLDDT:** A predicted local structure confidence score.
+- **KL-divergence:** A measure of difference between probability distributions, used here to compare generated and reference family residue distributions.
+## More Information
+All models and code are released through the Hugging Face ecosystem and accompanying code repository.
+## Model Card Authors
+Anonymous authors
 ## Model Card Contact
+More Information Needed