Joblib
pranamanam commited on
Commit
311d99e
·
verified ·
1 Parent(s): 36d2203

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +372 -369
README.md CHANGED
@@ -1,370 +1,373 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
4
-
5
- This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction.
6
-
7
- - `embeddings` folder contains processed huggingface datasets with peptideCLM embeddings. The `.csv` is the pre-processed data.
8
- - `metrics` folder contains the model performance on the validation data
9
- - `models` host all trained model weights
10
- - `training_data` host all **raw data** to train the classifiers
11
- - `functions` contains files to utilize the trained weights and classifiers
12
- - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
13
- - `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
14
-
15
- # PeptiVerse 🧬🌌
16
-
17
- A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
18
-
19
- ## Predictors 🧫
20
-
21
- PeptiVerse includes the following property predictors:
22
-
23
- | Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
24
- |-----------|-------------|-----------------| --------------------|--------------|------------|
25
- | **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
26
- | **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
27
- | **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
28
- | **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
29
- | **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
30
-
31
- ## Model Performance 🌟
32
-
33
- #### Binary Classification Predictors
34
-
35
- | Predictor | Val AUC | Val F1 |
36
- |-----------|----------------|----------|
37
- | **Non-Hemolysis** | 0.7902 | 0.8260 |
38
- | **Solubility** | 0.6016 | 0.5767 |
39
- | **Nonfouling** | 0.9327 | 0.8774 |
40
-
41
- #### Regression Predictors
42
-
43
- | Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
44
- |-----------|------------------------------|----------------------------|
45
- | **Permeability** | 0.958 | 0.710 |
46
- | **Binding Affinity** | 0.805 | 0.611 |
47
-
48
- ## Setup 🌟
49
-
50
- 1. Clone the repository:
51
- ```bash
52
- git clone https://github.com/sophtang/PeptiVerse.git
53
- cd PeptiVerse
54
- ```
55
-
56
- 2. Install environment:
57
- ```bash
58
- conda env create -f environment.yml
59
-
60
- conda activate peptiverse
61
- ```
62
-
63
- 3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
64
-
65
- ## Usage 🌟
66
-
67
- #### 1. Hemolysis Prediction
68
-
69
- Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
70
-
71
- ```python
72
- import sys
73
- sys.path.append('/path/to/PeptiVerse')
74
- from functions.hemolysis.hemolysis import Hemolysis
75
-
76
- # Initialize predictor
77
- hemo = Hemolysis()
78
-
79
- # Input peptide in SMILES format
80
- peptides = [
81
- "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
82
- ]
83
-
84
- # Get predictions
85
- scores = hemo(peptides)
86
- print(f"Non-hemolytic probability: {scores[0]:.3f}")
87
- ```
88
-
89
- **Output interpretation:**
90
- - Score close to 1.0 = likely non-hemolytic (safe)
91
- - Score close to 0.0 = likely hemolytic (unsafe)
92
-
93
- ---
94
-
95
- #### 2. Solubility Prediction
96
-
97
- Predicts aqueous solubility. Higher scores indicate better solubility.
98
-
99
- ```python
100
- from functions.solubility.solubility import Solubility
101
-
102
- # Initialize predictor
103
- sol = Solubility()
104
-
105
- # Input peptide
106
- peptides = [
107
- "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
108
- ]
109
-
110
- # Get predictions
111
- scores = sol(peptides)
112
- print(f"Solubility probability: {scores[0]:.3f}")
113
- ```
114
-
115
- **Output interpretation:**
116
- - Score close to 1.0 = highly soluble
117
- - Score close to 0.0 = poorly soluble
118
-
119
- ---
120
-
121
- #### 3. Nonfouling Prediction
122
-
123
- Predicts protein resistance/non-fouling properties.
124
-
125
- ```python
126
- from functions.nonfouling.nonfouling import Nonfouling
127
-
128
- # Initialize predictor
129
- nf = Nonfouling()
130
-
131
- # Input peptide
132
- peptides = [
133
- "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
134
- ]
135
-
136
- # Get predictions
137
- scores = nf(peptides)
138
- print(f"Nonfouling score: {scores[0]:.3f}")
139
- ```
140
-
141
- **Output interpretation:**
142
- - Higher scores = better non-fouling properties
143
-
144
- ---
145
-
146
- #### 4. Permeability Prediction
147
-
148
- Predicts membrane permeability on a log P scale.
149
-
150
- ```python
151
- from functions.permeability.permeability import Permeability
152
-
153
- # Initialize predictor
154
- perm = Permeability()
155
-
156
- # Input peptide
157
- peptides = [
158
- "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
159
- ]
160
-
161
- # Get predictions
162
- scores = perm(peptides)
163
- print(f"Permeability (log P): {scores[0]:.3f}")
164
- ```
165
-
166
- **Output interpretation:**
167
- - Higher values = more permeable
168
- - Typical range: -10 to 0 (log scale)
169
-
170
- ---
171
-
172
- #### 5. Binding Affinity Prediction
173
-
174
- Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
175
-
176
- ```python
177
- from functions.binding.binding import BindingAffinity
178
-
179
- # Target protein sequence (amino acid format)
180
- target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
181
-
182
- # Initialize predictor with target protein
183
- binding = BindingAffinity(prot_seq=target_protein)
184
-
185
- # Input peptide in SMILES format
186
- peptides = [
187
- "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
188
- ]
189
-
190
- # Get predictions
191
- scores = binding(peptides)
192
- print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
193
- ```
194
-
195
- **Output interpretation:**
196
- - Higher values = stronger binding
197
- - Scale: -log(Kd/Ki/IC50)
198
- - 7.5+ = tight binding (≤ ~30nM)
199
- - 6.0-7.5 = medium binding (~30nM - 1μM)
200
- - <6.0 = weak binding (> 1μM)
201
-
202
- ---
203
-
204
- ## Batch Processing 🌟
205
-
206
- All predictors support batch processing for multiple peptides:
207
-
208
- ```python
209
- from functions.hemolysis.hemolysis import Hemolysis
210
-
211
- hemo = Hemolysis()
212
-
213
- # Multiple peptides
214
- peptides = [
215
- "NCC(=O)N[C@H](CS)C(=O)O",
216
- "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
217
- "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
218
- ]
219
-
220
- # Get predictions for all
221
- scores = hemo(peptides)
222
-
223
- for i, score in enumerate(scores):
224
- print(f"Peptide {i+1}: {score:.3f}")
225
- ```
226
-
227
- ---
228
-
229
- ## Unified Scoring with Multiple Predictors 🌟
230
-
231
- For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
232
-
233
- ### Basic Usage
234
-
235
- ```python
236
- import sys
237
- sys.path.append('/path/to/PeptiVerse')
238
- from scoring_functions import ScoringFunctions
239
-
240
- # Initialize with desired scoring functions
241
- # Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
242
- # 'solubility', 'hemolysis', 'nonfouling'
243
- scoring = ScoringFunctions(
244
- score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
245
- prot_seqs=[] # Empty if not using binding affinity
246
- )
247
-
248
- # Input peptides in SMILES format
249
- peptides = [
250
- 'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
251
- 'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
252
- ]
253
-
254
- # Get scores (returns numpy array of shape: num_peptides x num_functions)
255
- scores = scoring(input_seqs=peptides)
256
- print(scores)
257
- ```
258
-
259
- ### Adding Binding Affinity
260
-
261
- ```python
262
- from scoring_functions import ScoringFunctions
263
-
264
- # Target protein sequence (amino acid format)
265
- tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
266
-
267
- # Initialize with binding affinity for one protein
268
- scoring = ScoringFunctions(
269
- score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
270
- prot_seqs=[tfr_protein] # Provide target protein sequence
271
- )
272
-
273
- peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
274
- scores = scoring(input_seqs=peptides)
275
-
276
- # scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
277
- print(f"Scores for peptide 1:")
278
- print(f" Binding Affinity: {scores[0][0]:.3f}")
279
- print(f" Solubility: {scores[0][1]:.3f}")
280
- print(f" Hemolysis: {scores[0][2]:.3f}")
281
- print(f" Permeability: {scores[0][3]:.3f}")
282
- ```
283
-
284
- ### Multiple Binding Targets
285
-
286
- ```python
287
- # For dual binding affinity prediction
288
- protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target
289
- protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target
290
-
291
- scoring = ScoringFunctions(
292
- score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
293
- prot_seqs=[protein1, protein2] # Provide both protein sequences
294
- )
295
-
296
- peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
297
- scores = scoring(input_seqs=peptides)
298
-
299
- # scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
300
- ```
301
-
302
- ### Output Format
303
-
304
- The `ScoringFunctions` class returns a numpy array where:
305
- - **Rows**: Each row corresponds to one input peptide
306
- - **Columns**: Each column corresponds to one scoring function (in the order specified)
307
-
308
- ```python
309
- # Example with 3 peptides and 4 scoring functions
310
- scores = scoring(input_seqs=peptides)
311
- # Shape: (3, 4)
312
- # scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
313
- # scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
314
- # scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
315
- ```
316
-
317
- ---
318
-
319
- ## Complete Example 🌟
320
-
321
- ```python
322
- import sys
323
- sys.path.append('/path/to/PeptiVerse')
324
- from functions.hemolysis.hemolysis import Hemolysis
325
- from functions.solubility.solubility import Solubility
326
- from functions.permeability.permeability import Permeability
327
-
328
- # Initialize predictors
329
- hemo = Hemolysis()
330
- sol = Solubility()
331
- perm = Permeability()
332
-
333
- # Test peptide
334
- peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
335
-
336
- # Get all predictions
337
- hemo_score = hemo(peptide)[0]
338
- sol_score = sol(peptide)[0]
339
- perm_score = perm(peptide)[0]
340
-
341
- print("Peptide Property Predictions:")
342
- print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
343
- print(f" Solubility: {sol_score:.3f}")
344
- print(f" Permeability: {perm_score:.3f}")
345
- ```
346
-
347
- ---
348
-
349
- ## Model Architecture 🌟
350
-
351
- All predictors use:
352
- - **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
353
- - **Classifier**: XGBoost gradient boosting
354
- - **Input**: SMILES representation of peptides
355
- - **Training**: Models trained on curated datasets with cross-validation
356
-
357
- ---
358
- ## Citation
359
-
360
- If you find this repository helpful for your publications, please consider citing our paper:
361
-
362
- ```
363
- @article{tang2025peptune,
364
- title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
365
- author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
366
- journal={42nd International Conference on Machine Learning},
367
- year={2025}
368
- }
369
- ```
 
 
 
370
  To use this repository, you agree to abide by the MIT License.
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ ---
4
+
5
+
6
+ ![Untitled design (3)](https://cdn-uploads.huggingface.co/production/uploads/64cd5b3f0494187a9e8b7c69/bpOe1xggl9lw90JMi3VsC.png)
7
+
8
+ This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction.
9
+
10
+ - `embeddings` folder contains processed huggingface datasets with PeptideCLM embeddings. The `.csv` is the pre-processed data.
11
+ - `metrics` folder contains the model performance on the validation data
12
+ - `models` host all trained model weights
13
+ - `training_data` host all **raw data** to train the classifiers
14
+ - `functions` contains files to utilize the trained weights and classifiers
15
+ - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
16
+ - `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
17
+
18
+ # PeptiVerse 🧬🌌
19
+
20
+ A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
21
+
22
+ ## Predictors 🧫
23
+
24
+ PeptiVerse includes the following property predictors:
25
+
26
+ | Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
27
+ |-----------|-------------|-----------------| --------------------|--------------|------------|
28
+ | **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
29
+ | **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
30
+ | **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
31
+ | **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
32
+ | **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
33
+
34
+ ## Model Performance 🌟
35
+
36
+ #### Binary Classification Predictors
37
+
38
+ | Predictor | Val AUC | Val F1 |
39
+ |-----------|----------------|----------|
40
+ | **Non-Hemolysis** | 0.7902 | 0.8260 |
41
+ | **Solubility** | 0.6016 | 0.5767 |
42
+ | **Nonfouling** | 0.9327 | 0.8774 |
43
+
44
+ #### Regression Predictors
45
+
46
+ | Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
47
+ |-----------|------------------------------|----------------------------|
48
+ | **Permeability** | 0.958 | 0.710 |
49
+ | **Binding Affinity** | 0.805 | 0.611 |
50
+
51
+ ## Setup 🌟
52
+
53
+ 1. Clone the repository:
54
+ ```bash
55
+ git clone https://github.com/sophtang/PeptiVerse.git
56
+ cd PeptiVerse
57
+ ```
58
+
59
+ 2. Install environment:
60
+ ```bash
61
+ conda env create -f environment.yml
62
+
63
+ conda activate peptiverse
64
+ ```
65
+
66
+ 3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
67
+
68
+ ## Usage 🌟
69
+
70
+ #### 1. Hemolysis Prediction
71
+
72
+ Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
73
+
74
+ ```python
75
+ import sys
76
+ sys.path.append('/path/to/PeptiVerse')
77
+ from functions.hemolysis.hemolysis import Hemolysis
78
+
79
+ # Initialize predictor
80
+ hemo = Hemolysis()
81
+
82
+ # Input peptide in SMILES format
83
+ peptides = [
84
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
85
+ ]
86
+
87
+ # Get predictions
88
+ scores = hemo(peptides)
89
+ print(f"Non-hemolytic probability: {scores[0]:.3f}")
90
+ ```
91
+
92
+ **Output interpretation:**
93
+ - Score close to 1.0 = likely non-hemolytic (safe)
94
+ - Score close to 0.0 = likely hemolytic (unsafe)
95
+
96
+ ---
97
+
98
+ #### 2. Solubility Prediction
99
+
100
+ Predicts aqueous solubility. Higher scores indicate better solubility.
101
+
102
+ ```python
103
+ from functions.solubility.solubility import Solubility
104
+
105
+ # Initialize predictor
106
+ sol = Solubility()
107
+
108
+ # Input peptide
109
+ peptides = [
110
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
111
+ ]
112
+
113
+ # Get predictions
114
+ scores = sol(peptides)
115
+ print(f"Solubility probability: {scores[0]:.3f}")
116
+ ```
117
+
118
+ **Output interpretation:**
119
+ - Score close to 1.0 = highly soluble
120
+ - Score close to 0.0 = poorly soluble
121
+
122
+ ---
123
+
124
+ #### 3. Nonfouling Prediction
125
+
126
+ Predicts protein resistance/non-fouling properties.
127
+
128
+ ```python
129
+ from functions.nonfouling.nonfouling import Nonfouling
130
+
131
+ # Initialize predictor
132
+ nf = Nonfouling()
133
+
134
+ # Input peptide
135
+ peptides = [
136
+ "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
137
+ ]
138
+
139
+ # Get predictions
140
+ scores = nf(peptides)
141
+ print(f"Nonfouling score: {scores[0]:.3f}")
142
+ ```
143
+
144
+ **Output interpretation:**
145
+ - Higher scores = better non-fouling properties
146
+
147
+ ---
148
+
149
+ #### 4. Permeability Prediction
150
+
151
+ Predicts membrane permeability on a log P scale.
152
+
153
+ ```python
154
+ from functions.permeability.permeability import Permeability
155
+
156
+ # Initialize predictor
157
+ perm = Permeability()
158
+
159
+ # Input peptide
160
+ peptides = [
161
+ "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
162
+ ]
163
+
164
+ # Get predictions
165
+ scores = perm(peptides)
166
+ print(f"Permeability (log P): {scores[0]:.3f}")
167
+ ```
168
+
169
+ **Output interpretation:**
170
+ - Higher values = more permeable
171
+ - Typical range: -10 to 0 (log scale)
172
+
173
+ ---
174
+
175
+ #### 5. Binding Affinity Prediction
176
+
177
+ Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
178
+
179
+ ```python
180
+ from functions.binding.binding import BindingAffinity
181
+
182
+ # Target protein sequence (amino acid format)
183
+ target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
184
+
185
+ # Initialize predictor with target protein
186
+ binding = BindingAffinity(prot_seq=target_protein)
187
+
188
+ # Input peptide in SMILES format
189
+ peptides = [
190
+ "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
191
+ ]
192
+
193
+ # Get predictions
194
+ scores = binding(peptides)
195
+ print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
196
+ ```
197
+
198
+ **Output interpretation:**
199
+ - Higher values = stronger binding
200
+ - Scale: -log(Kd/Ki/IC50)
201
+ - 7.5+ = tight binding (≤ ~30nM)
202
+ - 6.0-7.5 = medium binding (~30nM - 1μM)
203
+ - <6.0 = weak binding (> 1μM)
204
+
205
+ ---
206
+
207
+ ## Batch Processing 🌟
208
+
209
+ All predictors support batch processing for multiple peptides:
210
+
211
+ ```python
212
+ from functions.hemolysis.hemolysis import Hemolysis
213
+
214
+ hemo = Hemolysis()
215
+
216
+ # Multiple peptides
217
+ peptides = [
218
+ "NCC(=O)N[C@H](CS)C(=O)O",
219
+ "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
220
+ "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
221
+ ]
222
+
223
+ # Get predictions for all
224
+ scores = hemo(peptides)
225
+
226
+ for i, score in enumerate(scores):
227
+ print(f"Peptide {i+1}: {score:.3f}")
228
+ ```
229
+
230
+ ---
231
+
232
+ ## Unified Scoring with Multiple Predictors 🌟
233
+
234
+ For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
235
+
236
+ ### Basic Usage
237
+
238
+ ```python
239
+ import sys
240
+ sys.path.append('/path/to/PeptiVerse')
241
+ from scoring_functions import ScoringFunctions
242
+
243
+ # Initialize with desired scoring functions
244
+ # Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
245
+ # 'solubility', 'hemolysis', 'nonfouling'
246
+ scoring = ScoringFunctions(
247
+ score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
248
+ prot_seqs=[] # Empty if not using binding affinity
249
+ )
250
+
251
+ # Input peptides in SMILES format
252
+ peptides = [
253
+ 'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
254
+ 'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
255
+ ]
256
+
257
+ # Get scores (returns numpy array of shape: num_peptides x num_functions)
258
+ scores = scoring(input_seqs=peptides)
259
+ print(scores)
260
+ ```
261
+
262
+ ### Adding Binding Affinity
263
+
264
+ ```python
265
+ from scoring_functions import ScoringFunctions
266
+
267
+ # Target protein sequence (amino acid format)
268
+ tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
269
+
270
+ # Initialize with binding affinity for one protein
271
+ scoring = ScoringFunctions(
272
+ score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
273
+ prot_seqs=[tfr_protein] # Provide target protein sequence
274
+ )
275
+
276
+ peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
277
+ scores = scoring(input_seqs=peptides)
278
+
279
+ # scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
280
+ print(f"Scores for peptide 1:")
281
+ print(f" Binding Affinity: {scores[0][0]:.3f}")
282
+ print(f" Solubility: {scores[0][1]:.3f}")
283
+ print(f" Hemolysis: {scores[0][2]:.3f}")
284
+ print(f" Permeability: {scores[0][3]:.3f}")
285
+ ```
286
+
287
+ ### Multiple Binding Targets
288
+
289
+ ```python
290
+ # For dual binding affinity prediction
291
+ protein1 = "MMDQARSAFSNLFGGEPLSYTR..." # First target
292
+ protein2 = "MTKSNGEEPKMGGRMERFQQGV..." # Second target
293
+
294
+ scoring = ScoringFunctions(
295
+ score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
296
+ prot_seqs=[protein1, protein2] # Provide both protein sequences
297
+ )
298
+
299
+ peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
300
+ scores = scoring(input_seqs=peptides)
301
+
302
+ # scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
303
+ ```
304
+
305
+ ### Output Format
306
+
307
+ The `ScoringFunctions` class returns a numpy array where:
308
+ - **Rows**: Each row corresponds to one input peptide
309
+ - **Columns**: Each column corresponds to one scoring function (in the order specified)
310
+
311
+ ```python
312
+ # Example with 3 peptides and 4 scoring functions
313
+ scores = scoring(input_seqs=peptides)
314
+ # Shape: (3, 4)
315
+ # scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
316
+ # scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
317
+ # scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
318
+ ```
319
+
320
+ ---
321
+
322
+ ## Complete Example 🌟
323
+
324
+ ```python
325
+ import sys
326
+ sys.path.append('/path/to/PeptiVerse')
327
+ from functions.hemolysis.hemolysis import Hemolysis
328
+ from functions.solubility.solubility import Solubility
329
+ from functions.permeability.permeability import Permeability
330
+
331
+ # Initialize predictors
332
+ hemo = Hemolysis()
333
+ sol = Solubility()
334
+ perm = Permeability()
335
+
336
+ # Test peptide
337
+ peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
338
+
339
+ # Get all predictions
340
+ hemo_score = hemo(peptide)[0]
341
+ sol_score = sol(peptide)[0]
342
+ perm_score = perm(peptide)[0]
343
+
344
+ print("Peptide Property Predictions:")
345
+ print(f" Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
346
+ print(f" Solubility: {sol_score:.3f}")
347
+ print(f" Permeability: {perm_score:.3f}")
348
+ ```
349
+
350
+ ---
351
+
352
+ ## Model Architecture 🌟
353
+
354
+ All predictors use:
355
+ - **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
356
+ - **Classifier**: XGBoost gradient boosting
357
+ - **Input**: SMILES representation of peptides
358
+ - **Training**: Models trained on curated datasets with cross-validation
359
+
360
+ ---
361
+ ## Citation
362
+
363
+ If you find this repository helpful for your publications, please consider citing our paper:
364
+
365
+ ```
366
+ @article{tang2025peptune,
367
+ title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
368
+ author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
369
+ journal={42nd International Conference on Machine Learning},
370
+ year={2025}
371
+ }
372
+ ```
373
  To use this repository, you agree to abide by the MIT License.