lhallee commited on
Commit
de49bf5
·
verified ·
1 Parent(s): bd808c8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +391 -199
README.md CHANGED
@@ -1,199 +1,391 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+ # DSM: Diffusion Models for Protein Sequence Generation
6
+ ### Note: This readme is shared between our GitHub and Huggingface pages.
7
+
8
+ ## Table of Contents
9
+ - [Introduction](#introduction)
10
+ - [Models](#models)
11
+ - [Usage](#usage)
12
+ - [Demos](#usage)
13
+ - [Local installation](#installation)
14
+ - [Training](#training)
15
+ - [Evaluation](#evaluation)
16
+ - [Results](#results)
17
+ - [Cite](#cite)
18
+
19
+ ## Introduction
20
+
21
+ DSM (Diffusion Sequence Model) is a novel Protein Language Model (pLM) developed in collaboration between the Gleghorn Lab and [Synthyra](https://synthyra.com/). It was trained with masked diffusion to enable both high-quality representation learning and generative protein design. This repository contains the code for training, evaluating, and applying DSM and its variants.
22
+
23
+ DSM is capable of generating diverse, biomimetic sequences that align with expected amino acid compositions, secondary structures, and predicted functions. Furthermore, DSM's learned representations match or exceed those of comparably sized pLMs on various downstream tasks. DSM is detailed extensively in our [preprint](https://arxiv.org/abs/2506.08293) (which is currently in review). Beyond the base and PPI variants, we are currently training versions to jointly diffuse over sequence and foldseek tokens, as well as [Annotation Vocabulary](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1) tokens. Since the preprint release, Synthyra has trained [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) which neglects the LoRA PPI training in favor for full finetuning. Additionally, the sequences SeqA and SeqB are jointly masked instead of just SeqB in the original version. We plan on adding the **many** new results to the second version of the preprint and eventual journal article.
24
+
25
+ ## Models
26
+
27
+ Relevant Huggingface hosted models and datasets
28
+
29
+ - **Base DSM Models**:
30
+ - [GleghornLab/DSM_150](https://huggingface.co/GleghornLab/DSM_150) - 150M parameter DSM model
31
+ - [GleghornLab/DSM_650](https://huggingface.co/GleghornLab/DSM_650) - 650M parameter DSM model
32
+
33
+ - **DSM-ppi Models**:
34
+ (LoRA versions - results reported in paper but not recommended for real use)
35
+ - [GleghornLab/DSM_150_ppi_lora](https://huggingface.co/GleghornLab/DSM_150_ppi_lora) - 150M parameter LoRA DSM-ppi model
36
+ - [GleghornLab/DSM_650_ppi_lora](https://huggingface.co/GleghornLab/DSM_650_ppi_Lora) - 650M parameter LoRA DSM-ppi model
37
+ - [GleghornLab/DSM_150_ppi_control](https://huggingface.co/GleghornLab/DSM_150_ppi_control) - Control version of LoRA DSM-ppi
38
+ (Fully finetuned - recommended for real use)
39
+ - [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) - 650M parameter DSM-ppi model
40
+
41
+ - **Datasets**:
42
+ - [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) - Open MetaGenomic dataset clustered at 50% identity (207M sequences)
43
+ - [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090) - STRING database model organisms (653k sequences)
44
+
45
+ - **Utility Models**:
46
+ - [GleghornLab/production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) - Secondary structure prediction (4-class)
47
+ - [GleghornLab/production_ss9_model](https://huggingface.co/GleghornLab/production_ss9_model) - Secondary structure prediction (9-class)
48
+
49
+ ## Usage
50
+
51
+ This section outlines how to use a trained `DSM` model for common generation tasks. The core generation logic is provided by the `GenerateMixin` class, used by `DSM` models.
52
+
53
+ First, ensure you have a trained model (either one you trained or a pre-trained one from Hugging Face Hub) and the necessary environment set up.
54
+
55
+ ```python
56
+ import torch
57
+ from models.modeling_dsm import DSM # Or DSM_ppi for binder generation
58
+
59
+ # Load a pre-trained model
60
+ model_name_or_path = "GleghornLab/DSM_650" # Replace with your model of choice
61
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
62
+ model = DSM.from_pretrained(model_name_or_path).to(device).eval()
63
+ tokenizer = model.tokenizer
64
+ ```
65
+
66
+ ```console
67
+ You are using a model of type esm_diff to instantiate a model of type dsm. This is not supported for all configurations of models and can yield errors.
68
+ ```
69
+ This warning is normal - all good!
70
+
71
+ ### 1. Unconditional Sequence Generation
72
+ To generate a novel sequence of a specific length. DSM uses a progressive denoising approach.
73
+
74
+ ```python
75
+ ### Unconditional generation
76
+ length = 100
77
+ mask_token = tokenizer.mask_token
78
+ # optionally, enforce starting with methionine
79
+ input_template = tokenizer.encode('M' + ''.join([mask_token] * (length - 1)), add_special_tokens=True).to(device)
80
+ output = model.mask_diffusion_generate(
81
+ input_tokens=input_template,
82
+ step_divisor=100, # lower is slower but better
83
+ temperature=1.0, # sampling temperature
84
+ remasking="random", # strategy for remasking tokens not kept
85
+ preview=False #
86
+ )
87
+
88
+ generated_sequences = model.decode_output(output)
89
+ print(f"Generated sequence: {generated_sequences[0]}")
90
+ ```
91
+
92
+ ```console
93
+ Generated sequence: MFRVDALQVAQQETLAIGRSTAYDKQESPSMAQRQVLTQLAAYGGENDLRQICIPAERRNFLSIANGASYQFVEEDNEANGGYWSPHKAGLPESACKRFI
94
+ ```
95
+
96
+ ### 2. Mask Filling (Inpainting)
97
+ To fill in masked regions of a template sequence:
98
+
99
+ ```python
100
+ # Mask Filling / Inpainting
101
+ template_sequence = "MA<mask><mask><mask>KEG<mask><mask>STL"
102
+ template_tokens = model.tokenizer.encode(template_sequence, add_special_tokens=True).to(device)
103
+
104
+ filled_ids = model.mask_diffusion_generate(
105
+ input_tokens=template_tokens,
106
+ step_divisor=100, # lower is slower but better
107
+ temperature=1.0, # sampling temperature
108
+ remasking="random", # strategy for remasking tokens not kept
109
+ preview=False
110
+ )
111
+
112
+ generated_sequences = model.decode_output(output)
113
+ print(f"Generated sequence: {generated_sequences[0]}")
114
+ ```
115
+
116
+ ```console
117
+ Generated sequence: MAVKFKEGGISTL
118
+ ```
119
+
120
+ ### 3. Conditional Generation (e.g., Binders - using DSM-ppi)
121
+ ```python
122
+ # from models.modeling_dsm import DSM_ppi
123
+ # model_binder = DSM_ppi.from_pretrained("GleghornLab/DSM_650_ppi_lora").to(device).eval()
124
+ # The lora version from the paper leads to unreliable outputs
125
+ # Synthyra has generously trained a version through full fine tuning
126
+ from models.modeling_dsm import DSM
127
+
128
+ model_binder = DSM.from_pretrained("Synthyra/DSM_ppi_full").to(device).eval()
129
+
130
+ # BBF-14
131
+ target_seq = "MGTPLWALLGGPWRGTATYEDGTKVTLDYRYTRVSPDRLRADVTYTTPDGTTLEATVDLWKDANGVIRYHATYPDGTSADGTLTQLDADTLLATGTYDDGTKYTVTLTRVAPGSGWHHHHHH"
132
+ # For binder generation, the 'interactor' (SeqB) part is what gets generated/filled.
133
+ # Start with a fully masked interactor of desired length.
134
+ interactor_template_len = 256
135
+ interactor_template = ''.join([mask_token] * interactor_template_len)
136
+
137
+ combined_input_str = target_seq + '<eos>' + interactor_template
138
+
139
+ input_tokens = tokenizer.encode(combined_input_str, add_special_tokens=True, return_tensors='pt').to(device)
140
+
141
+ output = model_binder.mask_diffusion_generate(
142
+ tokenizer=tokenizer,
143
+ input_tokens=input_tokens,
144
+ step_divisor=10, # lower is slower but better
145
+ temperature=1.0, # sampling temperature
146
+ remasking="random", # strategy for remasking tokens not kept
147
+ preview=False, # set this to True to watch the mask tokens get rilled in real time
148
+ slow=False, # adds a small delay to the real time filling (because it is usually very fast and watching carefully is hard!)
149
+ return_trajectory=False # set this to True to return the trajectory of the generation (what you watch in the preview)
150
+ ) # Note: output will be a tuple if return_trajectory is True
151
+
152
+ target, binder = model.decode_dual_input(output, seperator='<eos>')
153
+ # Parse out the generated interactor part based on EOS tokens.
154
+ # Example: generated_full_seq_str.split(model_binder.tokenizer.eos_token)[1]
155
+ print(f"Generated binder {binder[0]}")
156
+ ```
157
+
158
+ ```console
159
+ Generated binder HRHHHRRPTHARETEWLARMRLGIAEHQRIAVPRSDLEPDQMRERAADNQRLVKEYDQVIDHQTEGSTERLFEVLRVWEQVNTEQAHHEASAALEFGRVGYPDDEGGRAFYTQANAHKKDLVEYIGGIDEDAKWDPRIAWLMPEGGQPVKATVIGVSEERINGLKVLDDHWGRERRLWLINLFTALQAYDDPTRPTQVTLTPATDQLTNDVQYLLLSTRYTPPGVTTAVKIRKLDGRTLKVLTTEAPYVVRGATLS
160
+ ```
161
+
162
+ Folded with Chai1:
163
+
164
+ ![image](https://github.com/user-attachments/assets/782d7bba-6f25-4a27-b0c4-fef88565dd33)
165
+
166
+ `Synthyra/DSM_ppi_full` was actually trained to fill masks from any part of SeqA and SeqB. That means you can fully hallucinate plausibly interacting protein pairs.
167
+
168
+
169
+
170
+
171
+ ## Demos
172
+ There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
173
+ We take 1000 proteins pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
174
+ This acts as a sanity check, as we expect the accuracy on reconstructing real positive PPIs to be higher than the accuracy on non-interacting proteins.
175
+ Indeed, this is the case:
176
+
177
+ ```console
178
+ ==================================================
179
+ RESULTS COMPARISON
180
+ ==================================================
181
+ Positive examples:
182
+ Mean accuracy: 0.495 ± 0.322
183
+ Processed: 1000 examples
184
+
185
+ Negative examples:
186
+ Mean accuracy: 0.227 ± 0.231
187
+ Processed: 1000 examples
188
+
189
+ Difference (Positive - Negative): 0.267
190
+ T-test: t=21.331, p=0.000
191
+ Difference is statistically significant (p < 0.05)
192
+ ```
193
+
194
+
195
+ ## Installation
196
+
197
+ 1. **Clone the repository:**
198
+ ```bash
199
+ git clone <repository-url>
200
+ cd <repository-name>
201
+ ```
202
+
203
+ 2. **Set up the Python virtual environment:**
204
+ The `setup_bioenv.sh` script creates a virtual environment named `bioenv` in your home directory (`~/bioenv`), installs PyTorch with CUDA 12.6 support, and then installs all other dependencies from `requirements.txt`.
205
+
206
+ Make the script executable:
207
+ ```bash
208
+ chmod +x setup_bioenv.sh
209
+ ```
210
+ Run the script:
211
+ ```bash
212
+ ./setup_bioenv.sh
213
+ ```
214
+
215
+ 3. **Activate the environment:**
216
+ Each time you want to work on this project, activate the virtual environment:
217
+ ```bash
218
+ source ~/bioenv/bin/activate
219
+ ```
220
+
221
+ 4. **To deactivate the environment:**
222
+ ```bash
223
+ deactivate
224
+ ```
225
+
226
+ ## Training
227
+
228
+ The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/Synthyra/omg_prot50).
229
+
230
+ ### Main Training Script: `train_dsm.py`
231
+
232
+ - **Base Model**: DSM models are extended from pre-trained ESM2 checkpoints (e.g., ESM2-150M, ESM2-650M).
233
+ - **Training Objective**: Masked diffusion loss, where the model predicts masked tokens. The loss is scaled by `1/(t + epsilon)` where `t` is the corruption level, penalizing errors more at low mask rates.
234
+ - **Language Modeling Head**: Uses a modified head with a soft-logit cap (`tau=30`) and tied output projection weights to the token embeddings.
235
+ - **Data Handling**:
236
+ - Training data can be streamed from datasets like [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) (a version of Open MetaGenomic dataset clustered at 50% identity).
237
+ - Uses `data.dataset_classes.SequenceDatasetFromList` for validation/test sets and `data.dataset_classes.IterableDatasetFromHF` for streaming training.
238
+ - `data.data_collators.SequenceCollator` is used for batching.
239
+ - **Training Process**:
240
+ - Utilizes Hugging Face `TrainingArguments`.
241
+ - A custom `IterableTrainer` (from `training.iterable_trainer.py`) handles iterable datasets.
242
+ - Uses AdamW optimizer and a cosine learning rate scheduler with linear warmup.
243
+ - Supports logging to Weights & Biases (wandb).
244
+ - The trained model can be pushed to Hugging Face Hub.
245
+ - Example checkpoints mentioned in the paper: [DSM-150](https://huggingface.co/GleghornLab/DSM_150) (from ESM2-150M, 100k steps, batch 32, seqlen 512, LR 1e-4) and [DSM-650](https://huggingface.co/GleghornLab/DSM_650) (from ESM2-650M, 100k steps, global batch 128, seqlen 2048, LR 1e-4).
246
+
247
+ **Usage Example:**
248
+
249
+ ```bash
250
+ python -m training.train_dsm \
251
+ --model_path facebook/esm2_t33_650M_UR50D \
252
+ --save_path GleghornLab/DSM_650 \
253
+ --lr 1e-4 \
254
+ --batch_size 8 \
255
+ --grad_accum 16 \
256
+ --max_steps 100000 \
257
+ --save_every 1000 \
258
+ --fp16 \
259
+ --wandb_project "DSM_Training" \
260
+ --token <your_hf_token_if_needed_for_private_repo_or_saving>
261
+ ```
262
+
263
+ **Key Command-Line Arguments for `train_dsm.py`:**
264
+
265
+ * `--token`: Hugging Face token.
266
+ * `--model_path`: Path to the base ESM2 model to start from.
267
+ * `--save_path`: Path to save the trained DSM model on Hugging Face Hub.
268
+ * `--lr`: Learning rate.
269
+ * `--batch_size`: Batch size per device.
270
+ * `--grad_accum`: Gradient accumulation steps.
271
+ * `--max_steps`: Maximum training steps.
272
+ * `--wandb_project`: Wandb project name (default: `DSM`).
273
+ * `--max_length`: Maximum sequence length.
274
+ * `--save_every`: Save model and evaluate every N steps.
275
+ * `--fp16`: Enable mixed-precision training.
276
+ * `--bugfix`: Use small batch size and max length for debugging.
277
+
278
+ ### Other Training Scripts (e.g., for DSM-ppi)
279
+
280
+ The `training/` directory may also contain scripts like `train_dsm_bind.py`.
281
+ - DSM-ppi (e.g., [DSM-150-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi), [DSM-650-ppi](https://huggingface.co/GleghornLab/DSM_650_ppi)) is fine-tuned on PPI datasets.
282
+ - Training involves conditioning on a target sequence (SeqA) to generate an interactor (SeqB) using the format `[CLS]--SeqA--[EOS]--[MASKED~SeqB]--[EOS]`.
283
+ - LoRA (Low-Rank Adaptation) can be applied to attention layers for efficient fine-tuning.
284
+
285
+ And `training/iterable_trainer.py` provides the `get_iterable_trainer` function used by `train_dsm.py` to enable training with iterable datasets.
286
+
287
+ ## Evaluation
288
+
289
+ The repository includes a comprehensive suite for evaluating model performance, focusing on:
290
+
291
+ 1. **Sequence Reconstruction (Mask Filling):**
292
+ * Evaluated by masking validation/test sets at various corruption rates (5% to 90%) and measuring cross-entropy loss, weighted F1 score, and Alignment Score (ASc) for the masked positions.
293
+ * The script `evaluation/mask_filling.py` is central to this.
294
+
295
+ 2. **Unconditional Generation Quality:**
296
+ * Generate a corpus of sequences based on lengths from a reference set (e.g., validation data).
297
+ * Compare distributions (1-mers, 2-mers, 3-mers) of amino acids and predicted secondary structures between generated and natural sequences using χ² test and Jensen-Shannon (JS) divergence.
298
+ * Compare distributions of predicted functional annotations (e.g., using Annotation Vocabulary - AV terms).
299
+ * Scripts involved: `evaluation/unconditional_generation_tuning.py` (to find optimal generation parameters like temperature and step divisor `s`), `evaluation/unconditional_generation.py`, `evaluation/ss_pred.py` (using [production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) or [production_ss9_model](https://huggingface.co/GleghornLab/production_ss9_model)), `evaluation/annotate_comparisons.py`, `evaluation/compare_distributions.py`, `evaluation/plot_distribution_comparisons.py`.
300
+ * The `run_eval_pipeline.py` script automates this workflow.
301
+
302
+ 3. **Representation Quality (Model Probing):**
303
+ * Evaluate learned embeddings by training linear probes (or simple transformer blocks) on various downstream tasks (e.g., secondary structure prediction, localization prediction, etc.).
304
+ * Performance is compared against random vectors, randomized transformers, and other established pLMs.
305
+
306
+ 4. **Conditional Generation (Binder Design for DSM-ppi):**
307
+ * Evaluate DSM-ppi on benchmarks like BenchBB.
308
+ * Generate binders for target proteins using template-based masking strategies.
309
+ * Assess generated binders using *in-silico* tools like Synteract2 for predicted binding affinity (ppKd).
310
+
311
+ The `evaluation/` directory also contains a `readme.md` which provides further details on some evaluation workflows. Key metrics used include:
312
+ - **Alignment Score (ASc):** A normalized Needleman-Wunsch global alignment score (using BLOSUM62) to measure sequence similarity, robust to length variations. ASc(a, b) = l/(f(a, a) - f(a, b) + l).
313
+ - **Jensen-Shannon (JS) Divergence:** To compare distributions of k-mers and functional terms.
314
+
315
+ **Running the Full Unconditional Evaluation Pipeline:**
316
+
317
+ ```bash
318
+ python run_eval_pipeline.py --token YOUR_HF_TOKEN --data_dir ./evaluation_results
319
+ ```
320
+
321
+ Refer to `run_eval_pipeline.py --help` for more options, such as `--skip_tuning`.
322
+
323
+ ### Mask Filling Evaluation
324
+
325
+ The script `evaluation/mask_filling.py` is used to evaluate models on their ability to predict masked tokens in a sequence across various masking rates.
326
+
327
+ - **Functionality:**
328
+ - Evaluates different models (DSM, DPLM, standard ESM models).
329
+ - Tests across multiple datasets ([Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50), [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090)).
330
+ - Calculates metrics: loss, perplexity, precision, recall, F1, accuracy, MCC, and alignment score.
331
+ - Saves detailed results to CSV files.
332
+ - Can generate a summary plot comparing model performance across different mask rates using `evaluation/plot_mask_fill_results.py`.
333
+
334
+ - **Usage Example:**
335
+ ```bash
336
+ python -m evaluation.mask_filling \
337
+ --token YOUR_HF_TOKEN \
338
+ --batch_size 4 \
339
+ --mask_rates 0.15 0.30 0.50 \
340
+ --data_splits valid test \
341
+ --results_dir ./results/mask_fill_custom
342
+ ```
343
+ To generate a comparison plot from existing results:
344
+ ```bash
345
+ python -m evaluation.mask_filling --generate_comparison_plot --results_dir ./results/mask_fill_custom --plot_output ./results/mask_fill_custom/comparison.png
346
+ ```
347
+
348
+ ### Other Evaluation Scripts
349
+
350
+ The `evaluation/` directory contains additional scripts for more specific analyses. These are typically run independently:
351
+
352
+ - `evaluation/all_targets_uncond.py` and `evaluation/all_targets_cond.py`: Likely for evaluating generation towards specific targets, unconditionally and conditionally.
353
+ - `evaluation/conditional_binder.py` and `evaluation/unconditional_binder.py`: Suggest evaluation focused on generating protein binders.
354
+ - `evaluation/unconditional_by_length.py`: May evaluate unconditional generation focusing on sequence length distributions.
355
+ - `evaluation/utils.py`: Utility functions for evaluation scripts.
356
+
357
+ Users should refer to individual scripts (e.g., using `python -m evaluation.<script_name> --help`) for their specific usage and arguments.
358
+ The `evaluation/` directory also contains a `readme.md` which provides further details on the unconditional generation evaluation workflow.
359
+
360
+ ## Results
361
+
362
+ DSM demonstrates strong performance in both protein sequence generation and representation learning, establishing masked diffusion as a powerful paradigm.
363
+
364
+ - **Biomimetic Sequence Generation**: Unconditionally generated DSM sequences closely mimic natural protein distributions in terms of amino acid k-mers, predicted secondary structures (JS divergence < 0.01 for AA k-mers), and predicted functional annotations (AV terms, JS divergence ~0.1). This suggests DSM captures underlying biological principles.
365
+
366
+ - **Superior Sequence Reconstruction**: DSM models significantly outperform MLM-based ESM2 models in reconstructing sequences from highly corrupted inputs (up to 90% masking).
367
+ - At 90% masking, DSM achieves an Alignment Score (ASc) of ~0.27, considerably higher than random.
368
+ - DSM models show higher F1 scores in reconstruction tasks compared to DPLM models, especially at high mask rates.
369
+
370
+ - **High-Quality Embeddings**: DSM embeddings match or exceed the quality of those from comparably sized pLMs (ESM2, DPLM) and even larger autoregressive models (ProtCLM 1B) on various downstream tasks evaluated by linear probing. [DSM-650](https://huggingface.co/GleghornLab/DSM_650) generally provides the best representations among tested models of similar size.
371
+
372
+ - **Effective Binder Design (DSM-ppi):**
373
+ - [DSM-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi) fine-tuned on protein-protein interaction data, demonstrates the ability to generate protein binders conditioned on target sequences.
374
+ - On the BenchBB benchmark, DSM-generated binders (both unconditional DSM and conditional DSM-ppi) show promising predicted binding affinities, in some cases superior to known binders. For example, designs for EGFR showed high predicted pKd and good structural metrics (ipTM, pTM with AlphaFold3).
375
+
376
+ - **Efficiency**: DSM can generate realistic protein sequences from a single forward pass during reconstruction tasks at high mask rates, offering potential efficiency advantages over iterative AR or some discrete diffusion models.
377
+
378
+ These results highlight DSM's capability to unify high-quality protein representation learning and biologically coherent generative modeling within a single framework.
379
+
380
+ ## Cite
381
+ ```
382
+ @misc{hallee2025diffusionsequencemodelsenhanced,
383
+ title={Diffusion Sequence Models for Enhanced Protein Representation and Generation},
384
+ author={Logan Hallee and Nikolaos Rafailidis and David B. Bichara and Jason P. Gleghorn},
385
+ year={2025},
386
+ eprint={2506.08293},
387
+ archivePrefix={arXiv},
388
+ primaryClass={q-bio.BM},
389
+ url={https://arxiv.org/abs/2506.08293},
390
+ }
391
+ ```