protgpt3 commited on
Commit
32fba5f
·
verified ·
1 Parent(s): 49b896a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -95
README.md CHANGED
@@ -1,199 +1,325 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
 
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
 
 
 
 
 
83
 
84
- ### Training Procedure
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
 
91
 
 
 
 
 
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
 
 
 
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
 
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ### Testing Data, Factors & Metrics
108
 
109
  #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
 
115
  #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
 
121
  #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
 
 
 
 
 
 
 
 
126
 
127
  ### Results
128
 
129
- [More Information Needed]
130
-
131
- #### Summary
132
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
  ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
  ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
  #### Hardware
164
 
165
- [More Information Needed]
166
 
167
  #### Software
168
 
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
  **BibTeX:**
176
 
177
- [More Information Needed]
 
 
 
 
 
 
178
 
179
  **APA:**
180
 
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
 
 
 
 
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - biology
5
+ - protein-language-model
6
+ - protein-generation
7
+ - msa
8
+ - multiple-sequence-alignment
9
+ - few-shot-prompting
10
+ - homolog-conditioned-generation
11
+ - causal-lm
12
+ - mixture-of-experts
13
+ - transformers
14
  ---
15
 
16
+ # Model Card for ProtGPT3-MSA
 
 
 
 
17
 
18
  ## Model Details
19
 
20
  ### Model Description
21
 
22
+ ProtGPT3-MSA is a multiple-sequence, homolog-conditioned autoregressive protein language model. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein sequence generation.
23
 
24
+ Unlike the single-sequence ProtGPT3 checkpoints, ProtGPT3-MSA can be prompted with sets of homologous protein sequences, enabling few-shot, family-conditioned protein generation without task-specific fine-tuning. At inference time, users can provide homologous protein sequences as context and generate additional family-consistent sequences.
25
 
26
+ ProtGPT3-MSA was initialized from the final ProtGPT3-112M training checkpoint and further trained to autoregressively model sets of 16 concatenated protein sequences. The model supports both aligned and unaligned prompting modes.
 
 
 
 
 
 
27
 
28
+ - **Developed by:** Anonymous authors
29
+ - **Model type:** Autoregressive MSA-promptable protein language model
30
+ - **Language(s):** Protein sequences / amino-acid sequences
31
+ - **License:** More Information Needed
32
+ - **Finetuned from model:** ProtGPT3-112M
33
 
34
+ ### Model Sources
35
 
36
+ - **Repository:** https://huggingface.co/protgpt3
37
+ - **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
38
+ - **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md
39
 
40
  ## Uses
41
 
 
 
42
  ### Direct Use
43
 
44
+ ProtGPT3-MSA is intended for few-shot, homolog-conditioned protein sequence generation. Users can prompt the model with related protein sequences from a target protein family to generate additional family-consistent sequences.
45
 
46
+ ### Downstream Use
47
 
48
+ ProtGPT3-MSA can be used in protein design workflows where users have a small set of homologous sequences and want to generate plausible additional sequences from the same family. It may be combined with computational screening, structural prediction, fitness prediction, solubility filtering, or other downstream validation pipelines.
 
 
 
 
49
 
50
  ### Out-of-Scope Use
51
 
52
+ The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.
53
 
54
+ The model should not be used for irresponsible or harmful biological design applications.
55
 
56
  ## Bias, Risks, and Limitations
57
 
58
+ ProtGPT3-MSA learns from public protein sequence and MSA datasets and may reproduce biases present in those datasets. The model depends on the quality, relevance, and diversity of the homologous sequences provided in the prompt. Poor, unrelated, noisy, contaminated, or incorrectly aligned prompts may reduce generation quality.
59
 
60
+ Generated sequences may be nonfunctional, unstable, insoluble, repetitive, low-complexity, or biologically implausible. As with other generative protein models, ProtGPT3-MSA may present dual-use risks if applied irresponsibly.
61
 
62
  ### Recommendations
63
 
64
+ Users should provide high-quality homologous protein sequences and validate generated sequences with appropriate downstream computational and experimental methods. For family-conditioned generation, users should carefully curate prompts and assess generated sequences using task-relevant criteria such as sequence identity, structural confidence, family-level consistency, solubility, and functional plausibility.
 
 
65
 
66
  ## How to Get Started with the Model
67
 
68
+ Install dependencies:
69
 
70
+ ```bash
71
+ pip install transformers accelerate torch
72
+ ```
73
 
74
+ Load the model and tokenizer:
75
 
76
+ ```python
77
+ import torch
78
+ from transformers import AutoTokenizer, AutoModelForCausalLM
79
 
80
+ model_id = "protgpt3/ProtGPT3-MSA" # Replace with the final checkpoint name
81
 
82
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_id,
85
+ torch_dtype=torch.bfloat16,
86
+ device_map="auto",
87
+ trust_remote_code=True,
88
+ )
89
 
90
+ model.eval()
91
+ ```
92
 
93
+ ### Few-shot generation with unaligned homologs
94
 
95
+ Use the `<no_gap>` modality token for unaligned sequences. Separate homologous sequences with the `<s>` separator token.
96
 
97
+ ```python
98
+ import torch
99
 
100
+ homologs = [
101
+ "MKTAYIAKQRQISFVKSHFSRQDILD",
102
+ "MKTVYIAKQRQISFVKSHFSRQDILD",
103
+ "MKTAYIAKQRQINNVKSHFSRQNILD",
104
+ # Add up to 15 homologous protein sequences
105
+ ]
106
 
107
+ prompt = "<no_gap>" + "<s>".join(homologs) + "<s>"
108
 
109
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
 
111
+ with torch.no_grad():
112
+ output_ids = model.generate(
113
+ **inputs,
114
+ max_new_tokens=512,
115
+ do_sample=True,
116
+ temperature=0.8,
117
+ top_p=0.9,
118
+ eos_token_id=tokenizer.eos_token_id,
119
+ pad_token_id=tokenizer.eos_token_id,
120
+ )
121
 
122
+ generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
123
+ print(generated)
124
+ ```
125
 
126
+ ### Few-shot generation with aligned homologs
127
 
128
+ Use the `<gap>` modality token for aligned sequences. Gap characters may be included in the prompted sequences.
129
+
130
+ ```python
131
+ import torch
132
 
133
+ aligned_homologs = [
134
+ "MKTAYIAKQRQI--SFVKSHFSRQDILD",
135
+ "MKTVYIAKQRQI--SFVKSHFSRQDILD",
136
+ "MKTAYIAKQRQINNSFVKSHFSRQNILD",
137
+ ]
138
+
139
+ prompt = "<gap>" + "<s>".join(aligned_homologs) + "<s>"
140
+
141
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
142
+
143
+ with torch.no_grad():
144
+ output_ids = model.generate(
145
+ **inputs,
146
+ max_new_tokens=512,
147
+ do_sample=True,
148
+ temperature=0.8,
149
+ top_p=0.9,
150
+ eos_token_id=tokenizer.eos_token_id,
151
+ pad_token_id=tokenizer.eos_token_id,
152
+ )
153
+
154
+ generated = tokenizer.decode(output_ids[0], skip_special_tokens=False)
155
+ print(generated)
156
+ ```
157
+
158
+ ### Extracting the newly generated sequence
159
+
160
+ Depending on tokenizer behavior and special-token handling, the decoded output may include the full prompt plus the continuation. A simple post-processing approach is to split on the sequence separator token and inspect the final generated segment:
161
+
162
+ ```python
163
+ decoded = tokenizer.decode(output_ids[0], skip_special_tokens=False)
164
+
165
+ segments = decoded.split("<s>")
166
+ generated_sequence = segments[-1].replace(tokenizer.eos_token or "", "").strip()
167
+
168
+ print(generated_sequence)
169
+ ```
170
+
171
+ ### Notes on prompting
172
+
173
+ - Use `<no_gap>` for unaligned homologous sequences.
174
+ - Use `<gap>` for aligned MSA-style inputs containing gap characters.
175
+ - Separate protein sequences with `<s>`.
176
+ - Provide up to 15 homologous sequences as context.
177
+ - Sampling parameters such as `temperature` and `top_p` can affect sequence quality, diversity, and family consistency.
178
+ - Generated sequences should be validated before experimental use.
179
+
180
+ ## Training Details
181
+
182
+ ### Training Data
183
+
184
+ ProtGPT3-MSA was trained on approximately 8.5M MSAs from the OpenProteinSet Uniclust30 dataset. From each MSA, 16 sequences were sampled without replacement and concatenated in random order. This process was repeated 15 times for each MSA, resulting in approximately 560B training tokens.
185
+
186
+ ### Training Procedure
187
+
188
+ #### Preprocessing
189
+
190
+ Each training example consisted of 16 concatenated protein sequences sampled from the same MSA. A special sequence separator token, `<s>`, was used to mark sequence boundaries.
191
+
192
+ Training included both aligned and unaligned modalities:
193
+
194
+ - `<gap>`: aligned modality, where sequences include gap tokens
195
+ - `<no_gap>`: unaligned modality, where sequences are provided without gaps
196
+
197
+ The model was trained autoregressively to predict concatenated protein sequences token by token.
198
+
199
+ #### Training Hyperparameters
200
+
201
+ - **Training regime:** bfloat16
202
+ - **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
203
+ - **Maximum sequence length:** 16,384
204
+ - **Optimizer:** AdamW
205
+ - **Learning rate:** 2e-4
206
+ - **Optimizer betas:** β1 = 0.9, β2 = 0.95
207
+ - **Weight decay:** 0.1
208
+ - **Gradient clipping:** 1.0
209
+ - **Gradient accumulation steps:** 16
210
+ - **Maximum tokens per batch:** 100,000
211
+ - **Router auxiliary loss coefficient:** 0.05
212
+ - **Number of training GPUs:** 4
213
+ - **Precision:** bfloat16
214
+
215
+ #### Speeds, Sizes, Times
216
+
217
+ - **Model size:** 112M parameters
218
+ - **Training tokens:** Approximately 560B
219
+ - **Training MSAs:** Approximately 8.5M
220
+
221
+ ## Evaluation
222
 
223
  ### Testing Data, Factors & Metrics
224
 
225
  #### Testing Data
226
 
227
+ ProtGPT3-MSA was evaluated on held-out protein families, ProteinGym, DMS stability libraries, held-out validation MSAs, PDB-derived MSAs, and targeted enzyme-generation case studies.
 
 
228
 
229
  #### Factors
230
 
231
+ Evaluation considered family-conditioned generation quality across different protein families, MSA depths, prompt compositions, aligned versus unaligned prompting, and sampling settings.
 
 
232
 
233
  #### Metrics
234
 
235
+ Evaluation included:
236
 
237
+ - ProteinGym Spearman correlation
238
+ - Sequence identity to held-out reference sequences
239
+ - Predicted structure confidence
240
+ - TM-score
241
+ - HHM profile comparison
242
+ - Positional KL-divergence
243
+ - DMS hit rate
244
+ - Computational success rate in targeted enzyme-generation case studies
245
+ - Experimental expression and purification outcomes for selected designs
246
 
247
  ### Results
248
 
249
+ ProtGPT3-MSA supports family-conditioned generation from small sets of homologous sequences. In the paper, prompting ProtGPT3-MSA with as few as 15 homologs produced family-consistent generations and compared favorably to supervised fine-tuning of single-sequence models.
 
 
250
 
251
+ In a low-data defluorinase case study using seven experimentally annotated sequences, ProtGPT3-MSA achieved substantially higher computational success rates than fine-tuned single-sequence baselines and produced designs that were soluble and expressed after experimental validation.
252
 
253
+ #### Summary
254
 
255
+ ProtGPT3-MSA enables prompt-based protein family conditioning without updating model weights. This makes it suitable for low-data protein design settings where a small number of homologous sequences are available.
256
 
257
+ ## Model Examination
258
 
259
+ ProtGPT3-MSA was examined for few-shot family-conditioned generation, aligned versus unaligned prompting, prompt ensembling, stability-aware generation, and inference-time steering using Feynman-Kac-style sequential Monte Carlo sampling.
260
 
261
  ## Environmental Impact
262
 
263
+ Carbon emissions can be estimated using the Machine Learning Impact calculator.
 
 
264
 
265
+ - **Hardware Type:** NVIDIA H100 GPUs
266
+ - **Hours used:** More Information Needed
267
+ - **Cloud Provider:** More Information Needed
268
+ - **Compute Region:** More Information Needed
269
+ - **Carbon Emitted:** More Information Needed
270
 
271
+ ## Technical Specifications
272
 
273
  ### Model Architecture and Objective
274
 
275
+ ProtGPT3-MSA is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained to model concatenated sets of related protein sequences, enabling homolog-conditioned generation through prompting.
276
 
277
+ The model processes up to 16 concatenated protein sequences and supports both aligned and unaligned modalities. During inference, users may provide up to 15 homologous sequences and generate an additional sequence conditioned on the prompt.
278
 
279
+ ### Compute Infrastructure
280
 
281
  #### Hardware
282
 
283
+ The model was trained on NVIDIA H100 GPUs.
284
 
285
  #### Software
286
 
287
+ Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.
 
 
288
 
289
+ ## Citation
290
 
291
  **BibTeX:**
292
 
293
+ ```bibtex
294
+ @article{protgpt3,
295
+ title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
296
+ author={Anonymous Authors},
297
+ year={2026}
298
+ }
299
+ ```
300
 
301
  **APA:**
302
 
303
+ Anonymous Authors. (2026). *ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models*.
 
 
304
 
305
+ ## Glossary
306
 
307
+ - **MSA:** Multiple sequence alignment, a collection of related protein sequences aligned by residue position.
308
+ - **Homologs:** Evolutionarily related protein sequences.
309
+ - **Few-shot prompting:** Conditioning a model on a small number of examples at inference time without updating model weights.
310
+ - **Causal language modeling:** Autoregressive prediction of the next token given previous tokens.
311
+ - **TM-score:** A metric for structural similarity between protein structures.
312
+ - **pLDDT:** A predicted local structure confidence score.
313
+ - **KL-divergence:** A measure of difference between probability distributions, used here to compare generated and reference family residue distributions.
314
 
315
+ ## More Information
316
 
317
+ All models and code are released through the Hugging Face ecosystem and accompanying code repository.
318
 
319
+ ## Model Card Authors
320
 
321
+ Anonymous authors
322
 
323
  ## Model Card Contact
324
 
325
+ More Information Needed