yinuozhang commited on
Commit
069410e
·
1 Parent(s): 2216d16
Files changed (1) hide show
  1. README.md +358 -1
README.md CHANGED
@@ -10,4 +10,361 @@ This repo contains important large files for [PeptiVerse](https://huggingface.co
10
  - `training_data` host all **raw data** to train the classifiers
11
  - `functions` contains files to utilize the trained weights and classifiers
12
  - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
13
- - `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - `training_data` host all **raw data** to train the classifiers
11
  - `functions` contains files to utilize the trained weights and classifiers
12
  - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
13
+ - `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
14
+
15
+ # PeptiVerse 🧬🌌
16
+
17
+ A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
18
+
19
+ ## Predictors 🧫
20
+
21
+ PeptiVerse includes the following property predictors:
22
+
23
+ | Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
24
+ |-----------|-------------|-----------------| --------------------|--------------|------------|
25
+ | **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
26
+ | **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
27
+ | **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
28
+ | **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
29
+ | **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
30
+
31
+ ## Model Performance 🌟
32
+
33
+ #### Binary Classification Predictors
34
+
35
+ | Predictor | Val AUC | Val F1 |
36
+ |-----------|----------------|----------|
37
+ | **Non-Hemolysis** | 0.7902 | 0.8260 |
38
+ | **Solubility** | 0.6016 | 0.5767 |
39
+ | **Nonfouling** | TBD | TBD |
40
+
41
+ #### Regression Predictors
42
+
43
+ | Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
44
+ |-----------|------------------------------|----------------------------|
45
+ | **Permeability** | 0.958 | 0.710 |
46
+ | **Binding Affinity** | 0.805 | 0.611 |
47
+
48
+ ## Setup 🌟
49
+
50
+ 1. Clone the repository:
51
+ ```bash
52
+ git clone https://github.com/sophtang/PeptiVerse.git
53
+ cd PeptiVerse
54
+ ```
55
+
56
+ 2. Install environment:
57
+ ```bash
58
+ conda env create -f environment.yml
59
+
60
+ conda activate peptiverse
61
+ ```
62
+
63
+ 3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
64
+
65
+ ## Usage 🌟
66
+
67
+ #### 1. Hemolysis Prediction
68
+
69
+ Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
70
+
71
+ ```python
72
+ import sys
73
+ sys.path.append('/path/to/PeptiVerse')
74
+ from functions.hemolysis.hemolysis import Hemolysis
75
+
76
+ # Initialize predictor
77
+ hemo = Hemolysis()
78
+
79
+ # Input peptide in SMILES format
80
+ peptides = [
81
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
82
+ ]
83
+
84
+ # Get predictions
85
+ scores = hemo(peptides)
86
+ print(f"Non-hemolytic probability: {scores[0]:.3f}")
87
+ ```
88
+
89
+ **Output interpretation:**
90
+ - Score close to 1.0 = likely non-hemolytic (safe)
91
+ - Score close to 0.0 = likely hemolytic (unsafe)
92
+
93
+ ---
94
+
95
+ #### 2. Solubility Prediction
96
+
97
+ Predicts aqueous solubility. Higher scores indicate better solubility.
98
+
99
+ ```python
100
+ from functions.solubility.solubility import Solubility
101
+
102
+ # Initialize predictor
103
+ sol = Solubility()
104
+
105
+ # Input peptide
106
+ peptides = [
107
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
108
+ ]
109
+
110
+ # Get predictions
111
+ scores = sol(peptides)
112
+ print(f"Solubility probability: {scores[0]:.3f}")
113
+ ```
114
+
115
+ **Output interpretation:**
116
+ - Score close to 1.0 = highly soluble
117
+ - Score close to 0.0 = poorly soluble
118
+
119
+ ---
120
+
121
+ #### 3. Nonfouling Prediction
122
+
123
+ Predicts protein resistance/non-fouling properties.
124
+
125
+ ```python
126
+ from functions.nonfouling.nonfouling import Nonfouling
127
+
128
+ # Initialize predictor
129
+ nf = Nonfouling()
130
+
131
+ # Input peptide
132
+ peptides = [
133
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
134
+ ]
135
+
136
+ # Get predictions
137
+ scores = nf(peptides)
138
+ print(f"Nonfouling score: {scores[0]:.3f}")
139
+ ```
140
+
141
+ **Output interpretation:**
142
+ - Higher scores = better non-fouling properties
143
+
144
+ ---
145
+
146
+ #### 4. Permeability Prediction
147
+
148
+ Predicts membrane permeability on a log P scale.
149
+
150
+ ```python
151
+ from functions.permeability.permeability import Permeability
152
+
153
+ # Initialize predictor
154
+ perm = Permeability()
155
+
156
+ # Input peptide
157
+ peptides = [
158
+ "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
159
+ ]
160
+
161
+ # Get predictions
162
+ scores = perm(peptides)
163
+ print(f"Permeability (log P): {scores[0]:.3f}")
164
+ ```
165
+
166
+ **Output interpretation:**
167
+ - Higher values = more permeable
168
+ - Typical range: -10 to 0 (log scale)
169
+
170
+ ---
171
+
172
+ #### 5. Binding Affinity Prediction
173
+
174
+ Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
175
+
176
+ ```python
177
+ from functions.binding.binding import BindingAffinity
178
+
179
+ # Target protein sequence (amino acid format)
180
+ target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
181
+
182
+ # Initialize predictor with target protein
183
+ binding = BindingAffinity(prot_seq=target_protein)
184
+
185
+ # Input peptide in SMILES format
186
+ peptides = [
187
+ "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
188
+ ]
189
+
190
+ # Get predictions
191
+ scores = binding(peptides)
192
+ print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
193
+ ```
194
+
195
+ **Output interpretation:**
196
+ - Higher values = stronger binding
197
+ - Scale: -log(Kd/Ki/IC50)
198
+ - 7.5+ = tight binding (≤ ~30nM)
199
+ - 6.0-7.5 = medium binding (~30nM - 1μM)
200
+ - <6.0 = weak binding (> 1μM)
201
+
202
+ ---
203
+
204
+ ## Batch Processing 🌟
205
+
206
+ All predictors support batch processing for multiple peptides:
207
+
208
+ ```python
209
+ from functions.hemolysis.hemolysis import Hemolysis
210
+
211
+ hemo = Hemolysis()
212
+
213
+ # Multiple peptides
214
+ peptides = [
215
+ "NCC(=O)N[C@H](CS)C(=O)O",
216
+ "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
217
+ "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
218
+ ]
219
+
220
+ # Get predictions for all
221
+ scores = hemo(peptides)
222
+
223
+ for i, score in enumerate(scores):
224
+ print(f"Peptide {i+1}: {score:.3f}")
225
+ ```
226
+
227
+ ---
228
+
229
+ ## Unified Scoring with Multiple Predictors 🌟
230
+
231
+ For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
232
+
233
+ ### Basic Usage
234
+
235
+ ```python
236
+ import sys
237
+ sys.path.append('/path/to/PeptiVerse')
238
+ from scoring_functions import ScoringFunctions
239
+
240
+ # Initialize with desired scoring functions
241
+ # Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
242
+ # 'solubility', 'hemolysis', 'nonfouling'
243
+ scoring = ScoringFunctions(
244
+ score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
245
+ prot_seqs=[] # Empty if not using binding affinity
246
+ )
247
+
248
+ # Input peptides in SMILES format
249
+ peptides = [
250
+ 'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
251
+ 'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
252
+ ]
253
+
254
+ # Get scores (returns numpy array of shape: num_peptides x num_functions)
255
+ scores = scoring(input_seqs=peptides)
256
+ print(scores)
257
+ ```
258
+
259
+ ### Adding Binding Affinity
260
+
261
+ ```python
262
+ from scoring_functions import ScoringFunctions
263
+
264
+ # Target protein sequence (amino acid format)
265
+ tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
266
+
267
+ # Initialize with binding affinity for one protein
268
+ scoring = ScoringFunctions(
269
+ score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
270
+ prot_seqs=[tfr_protein] # Provide target protein sequence
271
+ )
272
+
273
+ peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
274
+ scores = scoring(input_seqs=peptides)
275
+
276
+ # scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
277
+ print(f"Scores for peptide 1:")
278
+ print(f" Binding Affinity: {scores[0][0]:.3f}")
279
+ print(f" Solubility: {scores[0][1]:.3f}")
280
+ print(f" Hemolysis: {scores[0][2]:.3f}")
281
+ print(f" Permeability: {scores[0][3]:.3f}")
282
+ ```
283
+
284
+ ### Multiple Binding Targets
285
+
286
+ ```python
287
+ # For dual binding affinity prediction
288
+ protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target
289
+ protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target
290
+
291
+ scoring = ScoringFunctions(
292
+ score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
293
+ prot_seqs=[protein1, protein2] # Provide both protein sequences
294
+ )
295
+
296
+ peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
297
+ scores = scoring(input_seqs=peptides)
298
+
299
+ # scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
300
+ ```
301
+
302
+ ### Output Format
303
+
304
+ The `ScoringFunctions` class returns a numpy array where:
305
+ - **Rows**: Each row corresponds to one input peptide
306
+ - **Columns**: Each column corresponds to one scoring function (in the order specified)
307
+
308
+ ```python
309
+ # Example with 3 peptides and 4 scoring functions
310
+ scores = scoring(input_seqs=peptides)
311
+ # Shape: (3, 4)
312
+ # scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
313
+ # scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
314
+ # scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
315
+ ```
316
+
317
+ ---
318
+
319
+ ## Complete Example 🌟
320
+
321
+ ```python
322
+ import sys
323
+ sys.path.append('/path/to/PeptiVerse')
324
+ from functions.hemolysis.hemolysis import Hemolysis
325
+ from functions.solubility.solubility import Solubility
326
+ from functions.permeability.permeability import Permeability
327
+
328
+ # Initialize predictors
329
+ hemo = Hemolysis()
330
+ sol = Solubility()
331
+ perm = Permeability()
332
+
333
+ # Test peptide
334
+ peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
335
+
336
+ # Get all predictions
337
+ hemo_score = hemo(peptide)[0]
338
+ sol_score = sol(peptide)[0]
339
+ perm_score = perm(peptide)[0]
340
+
341
+ print("Peptide Property Predictions:")
342
+ print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
343
+ print(f" Solubility: {sol_score:.3f}")
344
+ print(f" Permeability: {perm_score:.3f}")
345
+ ```
346
+
347
+ ---
348
+
349
+ ## Model Architecture 🌟
350
+
351
+ All predictors use:
352
+ - **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
353
+ - **Classifier**: XGBoost gradient boosting
354
+ - **Input**: SMILES representation of peptides
355
+ - **Training**: Models trained on curated datasets with cross-validation
356
+
357
+ ---
358
+ ## Citation
359
+
360
+ If you find this repository helpful for your publications, please consider citing our paper:
361
+
362
+ ```
363
+ @article{tang2025peptune,
364
+ title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
365
+ author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
366
+ journal={42nd International Conference on Machine Learning},
367
+ year={2025}
368
+ }
369
+ ```
370
+ To use this repository, you agree to abide by the MIT License.