Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -34,15 +34,15 @@ Relevant Huggingface hosted models and datasets
|
|
| 34 |
- **DSM-ppi Models**:
|
| 35 |
(LoRA versions - results reported in paper but not recommended for real use)
|
| 36 |
- [GleghornLab/DSM_150_ppi_lora](https://huggingface.co/GleghornLab/DSM_150_ppi_lora) - 150M parameter LoRA DSM-ppi model
|
| 37 |
-
- [GleghornLab/DSM_650_ppi_lora](https://huggingface.co/GleghornLab/
|
| 38 |
- [GleghornLab/DSM_150_ppi_control](https://huggingface.co/GleghornLab/DSM_150_ppi_control) - Control version of LoRA DSM-ppi
|
| 39 |
|
| 40 |
(Fully finetuned - recommended for real use)
|
| 41 |
- [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) - 650M parameter DSM-ppi model
|
| 42 |
|
| 43 |
- **Datasets**:
|
| 44 |
-
- [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) - Open MetaGenomic dataset clustered at 50% identity (207M sequences)
|
| 45 |
-
- [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090) - STRING database model organisms (653k sequences)
|
| 46 |
|
| 47 |
- **Utility Models**:
|
| 48 |
- [GleghornLab/production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) - Secondary structure prediction (4-class)
|
|
@@ -207,7 +207,7 @@ Folded with Chai1:
|
|
| 207 |
|
| 208 |
## Demos
|
| 209 |
There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
|
| 210 |
-
We take 1000
|
| 211 |
This acts as a sanity check, as we expect the accuracy on reconstructing real positive PPIs to be higher than the accuracy on non-interacting proteins.
|
| 212 |
Indeed, this is the case:
|
| 213 |
|
|
@@ -262,7 +262,7 @@ Difference is statistically significant (p < 0.05)
|
|
| 262 |
|
| 263 |
## Training
|
| 264 |
|
| 265 |
-
The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/Synthyra/omg_prot50).
|
| 266 |
|
| 267 |
### Main Training Script: `train_dsm.py`
|
| 268 |
|
|
@@ -270,7 +270,7 @@ The primary script for training models is `training/train_dsm.py`. This script f
|
|
| 270 |
- **Training Objective**: Masked diffusion loss, where the model predicts masked tokens. The loss is scaled by `1/(t + epsilon)` where `t` is the corruption level, penalizing errors more at low mask rates.
|
| 271 |
- **Language Modeling Head**: Uses a modified head with a soft-logit cap (`tau=30`) and tied output projection weights to the token embeddings.
|
| 272 |
- **Data Handling**:
|
| 273 |
-
- Training data can be streamed from datasets like [Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50) (a version of Open MetaGenomic dataset clustered at 50% identity).
|
| 274 |
- Uses `data.dataset_classes.SequenceDatasetFromList` for validation/test sets and `data.dataset_classes.IterableDatasetFromHF` for streaming training.
|
| 275 |
- `data.data_collators.SequenceCollator` is used for batching.
|
| 276 |
- **Training Process**:
|
|
@@ -315,7 +315,7 @@ python -m training.train_dsm \
|
|
| 315 |
### Other Training Scripts (e.g., for DSM-ppi)
|
| 316 |
|
| 317 |
The `training/` directory may also contain scripts like `train_dsm_bind.py`.
|
| 318 |
-
- DSM-ppi (e.g., [DSM-150-ppi](https://huggingface.co/GleghornLab/
|
| 319 |
- Training involves conditioning on a target sequence (SeqA) to generate an interactor (SeqB) using the format `[CLS]--SeqA--[EOS]--[MASKED~SeqB]--[EOS]`.
|
| 320 |
- LoRA (Low-Rank Adaptation) can be applied to attention layers for efficient fine-tuning.
|
| 321 |
|
|
@@ -364,7 +364,7 @@ The script `evaluation/mask_filling.py` is used to evaluate models on their abil
|
|
| 364 |
|
| 365 |
- **Functionality:**
|
| 366 |
- Evaluates different models (DSM, DPLM, standard ESM models).
|
| 367 |
-
- Tests across multiple datasets ([Synthyra/omg_prot50](https://huggingface.co/Synthyra/omg_prot50), [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/GleghornLab/stringv12_modelorgs_9090)).
|
| 368 |
- Calculates metrics: loss, perplexity, precision, recall, F1, accuracy, MCC, and alignment score.
|
| 369 |
- Saves detailed results to CSV files.
|
| 370 |
- Can generate a summary plot comparing model performance across different mask rates using `evaluation/plot_mask_fill_results.py`.
|
|
@@ -408,7 +408,7 @@ DSM demonstrates strong performance in both protein sequence generation and repr
|
|
| 408 |
- **High-Quality Embeddings**: DSM embeddings match or exceed the quality of those from comparably sized pLMs (ESM2, DPLM) and even larger autoregressive models (ProtCLM 1B) on various downstream tasks evaluated by linear probing. [DSM-650](https://huggingface.co/GleghornLab/DSM_650) generally provides the best representations among tested models of similar size.
|
| 409 |
|
| 410 |
- **Effective Binder Design (DSM-ppi):**
|
| 411 |
-
-
|
| 412 |
- On the BenchBB benchmark, DSM-generated binders (both unconditional DSM and conditional DSM-ppi) show promising predicted binding affinities, in some cases superior to known binders. For example, designs for EGFR showed high predicted pKd and good structural metrics (ipTM, pTM with AlphaFold3).
|
| 413 |
|
| 414 |
- **Efficiency**: DSM can generate realistic protein sequences from a single forward pass during reconstruction tasks at high mask rates, offering potential efficiency advantages over iterative AR or some discrete diffusion models.
|
|
|
|
| 34 |
- **DSM-ppi Models**:
|
| 35 |
(LoRA versions - results reported in paper but not recommended for real use)
|
| 36 |
- [GleghornLab/DSM_150_ppi_lora](https://huggingface.co/GleghornLab/DSM_150_ppi_lora) - 150M parameter LoRA DSM-ppi model
|
| 37 |
+
- [GleghornLab/DSM_650_ppi_lora](https://huggingface.co/GleghornLab/DSM_650_ppi_lora) - 650M parameter LoRA DSM-ppi model
|
| 38 |
- [GleghornLab/DSM_150_ppi_control](https://huggingface.co/GleghornLab/DSM_150_ppi_control) - Control version of LoRA DSM-ppi
|
| 39 |
|
| 40 |
(Fully finetuned - recommended for real use)
|
| 41 |
- [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) - 650M parameter DSM-ppi model
|
| 42 |
|
| 43 |
- **Datasets**:
|
| 44 |
+
- [Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50) - Open MetaGenomic dataset clustered at 50% identity (207M sequences)
|
| 45 |
+
- [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/datasets/GleghornLab/stringv12_modelorgs_9090) - STRING database model organisms (653k sequences)
|
| 46 |
|
| 47 |
- **Utility Models**:
|
| 48 |
- [GleghornLab/production_ss4_model](https://huggingface.co/GleghornLab/production_ss4_model) - Secondary structure prediction (4-class)
|
|
|
|
| 207 |
|
| 208 |
## Demos
|
| 209 |
There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
|
| 210 |
+
We take 1000 protein pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
|
| 211 |
This acts as a sanity check, as we expect the accuracy on reconstructing real positive PPIs to be higher than the accuracy on non-interacting proteins.
|
| 212 |
Indeed, this is the case:
|
| 213 |
|
|
|
|
| 262 |
|
| 263 |
## Training
|
| 264 |
|
| 265 |
+
The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
|
| 266 |
|
| 267 |
### Main Training Script: `train_dsm.py`
|
| 268 |
|
|
|
|
| 270 |
- **Training Objective**: Masked diffusion loss, where the model predicts masked tokens. The loss is scaled by `1/(t + epsilon)` where `t` is the corruption level, penalizing errors more at low mask rates.
|
| 271 |
- **Language Modeling Head**: Uses a modified head with a soft-logit cap (`tau=30`) and tied output projection weights to the token embeddings.
|
| 272 |
- **Data Handling**:
|
| 273 |
+
- Training data can be streamed from datasets like [Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50) (a version of Open MetaGenomic dataset clustered at 50% identity).
|
| 274 |
- Uses `data.dataset_classes.SequenceDatasetFromList` for validation/test sets and `data.dataset_classes.IterableDatasetFromHF` for streaming training.
|
| 275 |
- `data.data_collators.SequenceCollator` is used for batching.
|
| 276 |
- **Training Process**:
|
|
|
|
| 315 |
### Other Training Scripts (e.g., for DSM-ppi)
|
| 316 |
|
| 317 |
The `training/` directory may also contain scripts like `train_dsm_bind.py`.
|
| 318 |
+
- DSM-ppi (e.g., [DSM-150-ppi](https://huggingface.co/GleghornLab/DSM_150_ppi_lora), [DSM-650-ppi](https://huggingface.co/GleghornLab/DSM_650_ppi_lora)) is fine-tuned on PPI datasets.
|
| 319 |
- Training involves conditioning on a target sequence (SeqA) to generate an interactor (SeqB) using the format `[CLS]--SeqA--[EOS]--[MASKED~SeqB]--[EOS]`.
|
| 320 |
- LoRA (Low-Rank Adaptation) can be applied to attention layers for efficient fine-tuning.
|
| 321 |
|
|
|
|
| 364 |
|
| 365 |
- **Functionality:**
|
| 366 |
- Evaluates different models (DSM, DPLM, standard ESM models).
|
| 367 |
+
- Tests across multiple datasets ([Synthyra/omg_prot50](https://huggingface.co/datasets/Synthyra/omg_prot50), [GleghornLab/stringv12_modelorgs_9090](https://huggingface.co/datasets/GleghornLab/stringv12_modelorgs_9090)).
|
| 368 |
- Calculates metrics: loss, perplexity, precision, recall, F1, accuracy, MCC, and alignment score.
|
| 369 |
- Saves detailed results to CSV files.
|
| 370 |
- Can generate a summary plot comparing model performance across different mask rates using `evaluation/plot_mask_fill_results.py`.
|
|
|
|
| 408 |
- **High-Quality Embeddings**: DSM embeddings match or exceed the quality of those from comparably sized pLMs (ESM2, DPLM) and even larger autoregressive models (ProtCLM 1B) on various downstream tasks evaluated by linear probing. [DSM-650](https://huggingface.co/GleghornLab/DSM_650) generally provides the best representations among tested models of similar size.
|
| 409 |
|
| 410 |
- **Effective Binder Design (DSM-ppi):**
|
| 411 |
+
- DSM-ppi fine-tuned on protein-protein interaction data, demonstrates the ability to generate protein binders conditioned on target sequences.
|
| 412 |
- On the BenchBB benchmark, DSM-generated binders (both unconditional DSM and conditional DSM-ppi) show promising predicted binding affinities, in some cases superior to known binders. For example, designs for EGFR showed high predicted pKd and good structural metrics (ipTM, pTM with AlphaFold3).
|
| 413 |
|
| 414 |
- **Efficiency**: DSM can generate realistic protein sequences from a single forward pass during reconstruction tasks at high mask rates, offering potential efficiency advantages over iterative AR or some discrete diffusion models.
|