lhallee commited on
Commit
19edba9
·
verified ·
1 Parent(s): e92ca8b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -17
README.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  # DSM: Diffusion Models for Protein Sequence Generation
2
  ### Note: This readme is shared between our GitHub and Huggingface pages.
3
 
@@ -10,17 +14,17 @@
10
  - [Training](#training)
11
  - [Evaluation](#evaluation)
12
  - [Results](#results)
13
- - [Cite](#Cite)
14
 
15
  ## Introduction
16
 
17
- DSM (Diffusion Sequence Model) is a novel Protein Language Model (pLM) developed in collaboration between the Gleghorn Lab and [Synthyra](https://synthyra.com/). It was trained with masked diffusion to enable both high-quality representation learning and generative protein design, detailed extensively in our [preprint](https://arxiv.org/abs/2506.08293). This repository contains the code for training and evaluating DSM and its variants.
18
 
19
- DSM is capable of generating diverse, biomimetic sequences that align with expected amino acid compositions, secondary structures, and predicted functions, even under high corruption rates. Furthermore, DSM's learned representations match or exceed those of comparably sized pLMs on various downstream tasks. The repository also includes DSM-ppi, a variant fine-tuned to generate protein binders by attending to target sequences.
20
 
21
  ## Models
22
 
23
- The following models are available on Hugging Face:
24
 
25
  - **Base DSM Models**:
26
  - [GleghornLab/DSM_150](https://huggingface.co/GleghornLab/DSM_150) - 150M parameter DSM model
@@ -114,8 +118,6 @@ Generated sequence: MAVKFKEGGISTL
114
  ```
115
 
116
  ### 3. Conditional Generation (e.g., Binders - using DSM-ppi)
117
- If using DSM-ppi, the input format is specific for generating a binder (SeqB) for a target (SeqA).
118
-
119
  ```python
120
  # from models.modeling_dsm import DSM_ppi
121
  # model_binder = DSM_ppi.from_pretrained("GleghornLab/DSM_650_ppi_lora").to(device).eval()
@@ -125,24 +127,29 @@ from models.modeling_dsm import DSM
125
 
126
  model_binder = DSM.from_pretrained("Synthyra/DSM_ppi_full").to(device).eval()
127
 
128
- target_seq = "TARGETSEQUENCEAMINOACIDS"
 
129
  # For binder generation, the 'interactor' (SeqB) part is what gets generated/filled.
130
  # Start with a fully masked interactor of desired length.
131
- interactor_template_len = 20
132
  interactor_template = ''.join([mask_token] * interactor_template_len)
133
 
134
  combined_input_str = target_seq + '<eos>' + interactor_template
135
 
136
- binder_template_tokens = tokenizer.encode(combined_input_str, add_special_tokens=True).to(device)
137
 
138
  output = model_binder.mask_diffusion_generate(
139
- input_tokens=binder_template_tokens,
140
- step_divisor=100, # lower is slower but better
141
- temperature=1.0, # sampling temperature
142
- remasking="random", # strategy for remasking tokens not kept
143
- )
144
-
145
- target, binder = model.decode_dual_input(output, seperator='<eos>;)
 
 
 
 
146
  # Parse out the generated interactor part based on EOS tokens.
147
  # Example: generated_full_seq_str.split(model_binder.tokenizer.eos_token)[1]
148
  print(f"Generated binder {binder[0]}")
@@ -152,6 +159,15 @@ print(f"Generated binder {binder[0]}")
152
  Generated binder HRHHHRRPTHARETEWLARMRLGIAEHQRIAVPRSDLEPDQMRERAADNQRLVKEYDQVIDHQTEGSTERLFEVLRVWEQVNTEQAHHEASAALEFGRVGYPDDEGGRAFYTQANAHKKDLVEYIGGIDEDAKWDPRIAWLMPEGGQPVKATVIGVSEERINGLKVLDDHWGRERRLWLINLFTALQAYDDPTRPTQVTLTPATDQLTNDVQYLLLSTRYTPPGVTTAVKIRKLDGRTLKVLTTEAPYVVRGATLS
153
  ```
154
 
 
 
 
 
 
 
 
 
 
155
  ## Demos
156
  There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
157
  We take 1000 proteins pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
@@ -372,4 +388,4 @@ These results highlight DSM's capability to unify high-quality protein represent
372
  primaryClass={q-bio.BM},
373
  url={https://arxiv.org/abs/2506.08293},
374
  }
375
- ```
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
  # DSM: Diffusion Models for Protein Sequence Generation
6
  ### Note: This readme is shared between our GitHub and Huggingface pages.
7
 
 
14
  - [Training](#training)
15
  - [Evaluation](#evaluation)
16
  - [Results](#results)
17
+ - [Cite](#cite)
18
 
19
  ## Introduction
20
 
21
+ DSM (Diffusion Sequence Model) is a novel Protein Language Model (pLM) developed in collaboration between the Gleghorn Lab and [Synthyra](https://synthyra.com/). It was trained with masked diffusion to enable both high-quality representation learning and generative protein design. This repository contains the code for training, evaluating, and applying DSM and its variants.
22
 
23
+ DSM is capable of generating diverse, biomimetic sequences that align with expected amino acid compositions, secondary structures, and predicted functions. Furthermore, DSM's learned representations match or exceed those of comparably sized pLMs on various downstream tasks. DSM is detailed extensively in our [preprint](https://arxiv.org/abs/2506.08293) (which is currently in review). Beyond the base and PPI variants, we are currently training versions to jointly diffuse over sequence and foldseek tokens, as well as [Annotation Vocabulary](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1) tokens. Since the preprint release, Synthyra has trained [Synthyra/DSM_ppi_full](https://huggingface.co/Synthyra/DSM_ppi_full) which neglects the LoRA PPI training in favor for full finetuning. Additionally, the sequences SeqA and SeqB are jointly masked instead of just SeqB in the original version. We plan on adding the **many** new results to the second version of the preprint and eventual journal article.
24
 
25
  ## Models
26
 
27
+ Relevant Huggingface hosted models and datasets
28
 
29
  - **Base DSM Models**:
30
  - [GleghornLab/DSM_150](https://huggingface.co/GleghornLab/DSM_150) - 150M parameter DSM model
 
118
  ```
119
 
120
  ### 3. Conditional Generation (e.g., Binders - using DSM-ppi)
 
 
121
  ```python
122
  # from models.modeling_dsm import DSM_ppi
123
  # model_binder = DSM_ppi.from_pretrained("GleghornLab/DSM_650_ppi_lora").to(device).eval()
 
127
 
128
  model_binder = DSM.from_pretrained("Synthyra/DSM_ppi_full").to(device).eval()
129
 
130
+ # BBF-14
131
+ target_seq = "MGTPLWALLGGPWRGTATYEDGTKVTLDYRYTRVSPDRLRADVTYTTPDGTTLEATVDLWKDANGVIRYHATYPDGTSADGTLTQLDADTLLATGTYDDGTKYTVTLTRVAPGSGWHHHHHH"
132
  # For binder generation, the 'interactor' (SeqB) part is what gets generated/filled.
133
  # Start with a fully masked interactor of desired length.
134
+ interactor_template_len = 256
135
  interactor_template = ''.join([mask_token] * interactor_template_len)
136
 
137
  combined_input_str = target_seq + '<eos>' + interactor_template
138
 
139
+ input_tokens = tokenizer.encode(combined_input_str, add_special_tokens=True, return_tensors='pt').to(device)
140
 
141
  output = model_binder.mask_diffusion_generate(
142
+ tokenizer=tokenizer,
143
+ input_tokens=input_tokens,
144
+ step_divisor=10, # lower is slower but better
145
+ temperature=1.0, # sampling temperature
146
+ remasking="random", # strategy for remasking tokens not kept
147
+ preview=False, # set this to True to watch the mask tokens get rilled in real time
148
+ slow=False, # adds a small delay to the real time filling (because it is usually very fast and watching carefully is hard!)
149
+ return_trajectory=False # set this to True to return the trajectory of the generation (what you watch in the preview)
150
+ ) # Note: output will be a tuple if return_trajectory is True
151
+
152
+ target, binder = model.decode_dual_input(output, seperator='<eos>')
153
  # Parse out the generated interactor part based on EOS tokens.
154
  # Example: generated_full_seq_str.split(model_binder.tokenizer.eos_token)[1]
155
  print(f"Generated binder {binder[0]}")
 
159
  Generated binder HRHHHRRPTHARETEWLARMRLGIAEHQRIAVPRSDLEPDQMRERAADNQRLVKEYDQVIDHQTEGSTERLFEVLRVWEQVNTEQAHHEASAALEFGRVGYPDDEGGRAFYTQANAHKKDLVEYIGGIDEDAKWDPRIAWLMPEGGQPVKATVIGVSEERINGLKVLDDHWGRERRLWLINLFTALQAYDDPTRPTQVTLTPATDQLTNDVQYLLLSTRYTPPGVTTAVKIRKLDGRTLKVLTTEAPYVVRGATLS
160
  ```
161
 
162
+ Folded with Chai1:
163
+
164
+ ![image](https://github.com/user-attachments/assets/782d7bba-6f25-4a27-b0c4-fef88565dd33)
165
+
166
+ `Synthyra/DSM_ppi_full` was actually trained to fill masks from any part of SeqA and SeqB. That means you can fully hallucinate plausibly interacting protein pairs.
167
+
168
+
169
+
170
+
171
  ## Demos
172
  There are various demos with many more to come. For example, in `demo_dsm_ppi_full.py` (run by `python -m demos.demo_dsm_ppi_full`) we perform a test on DSM-ppi.
173
  We take 1000 proteins pairs from BIOGRID (real protein-protein interactions) and 1000 from Negatome (non interacting protein pairs) and mask the second sequence (SeqB) by 50%.
 
388
  primaryClass={q-bio.BM},
389
  url={https://arxiv.org/abs/2506.08293},
390
  }
391
+ ```