lhallee commited on
Commit
7588282
·
verified ·
1 Parent(s): b97d6e2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -34,15 +34,15 @@ Relevant Huggingface hosted models and datasets
34
  - **DSM-ppi Models**:
35
  (LoRA versions - results reported in paper but not recommended for real use)
36
  - [GleghornLab/DSM_150_ppi_lora](https://huggingface.co/GleghornLab/DSM_150_ppi_lora) - 150M parameter LoRA DSM-ppi model
37
- - [GleghornLab/DSM_650_ppi_lora](https://huggingface.co/GleghornLab/DSM_650_ppi_Lora) - 650M parameter LoRA DSM-ppi model
38
  - [GleghornLab/DSM_150_ppi_control](https://huggingface.co/GleghornLab/DSM_150_ppi_control) - Control version of LoRA DSM-ppi
39
 
40
  (Fully finetuned - recommended for real use)
41
  - [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) - 650M parameter DSM-ppi model
42
 
43
  - **Datasets**:
44
- - [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) - Open MetaGenomic dataset clustered at 50% identity (207M sequences)
45
- - [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090) - STRING database model organisms (653k sequences)
46
 
47
  - **Utility Models**:
48
  - [GleghornLab/production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) - Secondary structure prediction (4-class)
@@ -207,7 +207,7 @@ Folded with Chai1:
207
 
208
  ## Demos
209
  There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
210
- We take 1000 proteins pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
211
  This acts as a sanity check, as we expect the accuracy on reconstructing real positive PPIs to be higher than the accuracy on non-interacting proteins.
212
  Indeed, this is the case:
213
 
@@ -262,7 +262,7 @@ Difference is statistically significant (p < 0.05)
262
 
263
  ## Training
264
 
265
- The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/Synthyra/omg_prot50).
266
 
267
  ### Main Training Script: `train_dsm.py`
268
 
@@ -270,7 +270,7 @@ The primary script for training models is `training/train_dsm.py`. This script f
270
  - **Training Objective**: Masked diffusion loss, where the model predicts masked tokens. The loss is scaled by `1/(t + epsilon)` where `t` is the corruption level, penalizing errors more at low mask rates.
271
  - **Language Modeling Head**: Uses a modified head with a soft-logit cap (`tau=30`) and tied output projection weights to the token embeddings.
272
  - **Data Handling**:
273
- - Training data can be streamed from datasets like [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) (a version of Open MetaGenomic dataset clustered at 50% identity).
274
  - Uses `data.dataset_classes.SequenceDatasetFromList` for validation/test sets and `data.dataset_classes.IterableDatasetFromHF` for streaming training.
275
  - `data.data_collators.SequenceCollator` is used for batching.
276
  - **Training Process**:
@@ -315,7 +315,7 @@ python -m training.train_dsm \
315
  ### Other Training Scripts (e.g., for DSM-ppi)
316
 
317
  The `training/` directory may also contain scripts like `train_dsm_bind.py`.
318
- - DSM-ppi (e.g., [DSM-150-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi), [DSM-650-ppi](https://huggingface.co/GleghornLab/DSM_650_ppi)) is fine-tuned on PPI datasets.
319
  - Training involves conditioning on a target sequence (SeqA) to generate an interactor (SeqB) using the format `[CLS]--SeqA--[EOS]--[MASKED~SeqB]--[EOS]`.
320
  - LoRA (Low-Rank Adaptation) can be applied to attention layers for efficient fine-tuning.
321
 
@@ -364,7 +364,7 @@ The script `evaluation/mask_filling.py` is used to evaluate models on their abil
364
 
365
  - **Functionality:**
366
  - Evaluates different models (DSM, DPLM, standard ESM models).
367
- - Tests across multiple datasets ([Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50), [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090)).
368
  - Calculates metrics: loss, perplexity, precision, recall, F1, accuracy, MCC, and alignment score.
369
  - Saves detailed results to CSV files.
370
  - Can generate a summary plot comparing model performance across different mask rates using `evaluation/plot_mask_fill_results.py`.
@@ -408,7 +408,7 @@ DSM demonstrates strong performance in both protein sequence generation and repr
408
  - **High-Quality Embeddings**: DSM embeddings match or exceed the quality of those from comparably sized pLMs (ESM2, DPLM) and even larger autoregressive models (ProtCLM 1B) on various downstream tasks evaluated by linear probing. [DSM-650](https://huggingface.co/GleghornLab/DSM_650) generally provides the best representations among tested models of similar size.
409
 
410
  - **Effective Binder Design (DSM-ppi):**
411
- - [DSM-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi) fine-tuned on protein-protein interaction data, demonstrates the ability to generate protein binders conditioned on target sequences.
412
  - On the BenchBB benchmark, DSM-generated binders (both unconditional DSM and conditional DSM-ppi) show promising predicted binding affinities, in some cases superior to known binders. For example, designs for EGFR showed high predicted pKd and good structural metrics (ipTM, pTM with AlphaFold3).
413
 
414
  - **Efficiency**: DSM can generate realistic protein sequences from a single forward pass during reconstruction tasks at high mask rates, offering potential efficiency advantages over iterative AR or some discrete diffusion models.
 
34
  - **DSM-ppi Models**:
35
  (LoRA versions - results reported in paper but not recommended for real use)
36
  - [GleghornLab/DSM_150_ppi_lora](https://huggingface.co/GleghornLab/DSM_150_ppi_lora) - 150M parameter LoRA DSM-ppi model
37
+ - [GleghornLab/DSM_650_ppi_lora](https://huggingface.co/GleghornLab/DSM_650_ppi_lora) - 650M parameter LoRA DSM-ppi model
38
  - [GleghornLab/DSM_150_ppi_control](https://huggingface.co/GleghornLab/DSM_150_ppi_control) - Control version of LoRA DSM-ppi
39
 
40
  (Fully finetuned - recommended for real use)
41
  - [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) - 650M parameter DSM-ppi model
42
 
43
  - **Datasets**:
44
+ - [Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50) - Open MetaGenomic dataset clustered at 50% identity (207M sequences)
45
+ - [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/datasets/GleghornLab/stringv12_modelorgs_9090) - STRING database model organisms (653k sequences)
46
 
47
  - **Utility Models**:
48
  - [GleghornLab/production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) - Secondary structure prediction (4-class)
 
207
 
208
  ## Demos
209
  There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
210
+ We take 1000 protein pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
211
  This acts as a sanity check, as we expect the accuracy on reconstructing real positive PPIs to be higher than the accuracy on non-interacting proteins.
212
  Indeed, this is the case:
213
 
 
262
 
263
  ## Training
264
 
265
+ The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
266
 
267
  ### Main Training Script: `train_dsm.py`
268
 
 
270
  - **Training Objective**: Masked diffusion loss, where the model predicts masked tokens. The loss is scaled by `1/(t + epsilon)` where `t` is the corruption level, penalizing errors more at low mask rates.
271
  - **Language Modeling Head**: Uses a modified head with a soft-logit cap (`tau=30`) and tied output projection weights to the token embeddings.
272
  - **Data Handling**:
273
+ - Training data can be streamed from datasets like [Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50) (a version of Open MetaGenomic dataset clustered at 50% identity).
274
  - Uses `data.dataset_classes.SequenceDatasetFromList` for validation/test sets and `data.dataset_classes.IterableDatasetFromHF` for streaming training.
275
  - `data.data_collators.SequenceCollator` is used for batching.
276
  - **Training Process**:
 
315
  ### Other Training Scripts (e.g., for DSM-ppi)
316
 
317
  The `training/` directory may also contain scripts like `train_dsm_bind.py`.
318
+ - DSM-ppi (e.g., [DSM-150-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi_lora), [DSM-650-ppi](https://huggingface.co/GleghornLab/DSM_650_ppi_lora)) is fine-tuned on PPI datasets.
319
  - Training involves conditioning on a target sequence (SeqA) to generate an interactor (SeqB) using the format `[CLS]--SeqA--[EOS]--[MASKED~SeqB]--[EOS]`.
320
  - LoRA (Low-Rank Adaptation) can be applied to attention layers for efficient fine-tuning.
321
 
 
364
 
365
  - **Functionality:**
366
  - Evaluates different models (DSM, DPLM, standard ESM models).
367
+ - Tests across multiple datasets ([Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50), [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/datasets/GleghornLab/stringv12_modelorgs_9090)).
368
  - Calculates metrics: loss, perplexity, precision, recall, F1, accuracy, MCC, and alignment score.
369
  - Saves detailed results to CSV files.
370
  - Can generate a summary plot comparing model performance across different mask rates using `evaluation/plot_mask_fill_results.py`.
 
408
  - **High-Quality Embeddings**: DSM embeddings match or exceed the quality of those from comparably sized pLMs (ESM2, DPLM) and even larger autoregressive models (ProtCLM 1B) on various downstream tasks evaluated by linear probing. [DSM-650](https://huggingface.co/GleghornLab/DSM_650) generally provides the best representations among tested models of similar size.
409
 
410
  - **Effective Binder Design (DSM-ppi):**
411
+ - DSM-ppi fine-tuned on protein-protein interaction data, demonstrates the ability to generate protein binders conditioned on target sequences.
412
  - On the BenchBB benchmark, DSM-generated binders (both unconditional DSM and conditional DSM-ppi) show promising predicted binding affinities, in some cases superior to known binders. For example, designs for EGFR showed high predicted pKd and good structural metrics (ipTM, pTM with AlphaFold3).
413
 
414
  - **Efficiency**: DSM can generate realistic protein sequences from a single forward pass during reconstruction tasks at high mask rates, offering potential efficiency advantages over iterative AR or some discrete diffusion models.