ChatterjeeLab
/

Classifier_Weight

Model card Files Files and versions

xet

Community

yinuozhang commited on 29 days ago

Commit

069410e

1 Parent(s): 2216d16

readme

Browse files

Files changed (1) hide show

README.md +358 -1

README.md CHANGED Viewed

@@ -10,4 +10,361 @@ This repo contains important large files for [PeptiVerse](https://huggingface.co
 - `training_data` host all **raw data** to train the classifiers
 - `functions` contains files to utilize the trained weights and classifiers
 - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
-- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications

 - `training_data` host all **raw data** to train the classifiers
 - `functions` contains files to utilize the trained weights and classifiers
 - `train` contains the script to train classifiers on the pre-processed embeddings, either through xgboost or MLPs.
+- `scoring_function.py` contains a class that aggregates all trained classifiers for diverse downstream sampling applications
+# PeptiVerse 🧬🌌
+A collection of machine learning predictors for non-canonical and canonical peptide property prediction for SMILES representation. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
+## Predictors 🧫
+PeptiVerse includes the following property predictors:
+| Predictor | Measurement | Interpretation | Training Data Source | Dataset Size | Model Type |
+|-----------|-------------|-----------------| --------------------|--------------|------------|
+| **Non-Hemolysis** | Probability of non-hemolytic behavior | 0-1 scale, higher = less hemolytic | PeptideBERT, PepLand | 6,077 peptides | XGBoost + PeptideCLM embeddings |
+| **Solubility** | Probability of aqueous solubility | 0-1 scale, higher = more soluble | PeptideBERT, PepLand | 18,454 peptides | XGBoost + PeptideCLM embeddings |
+| **Non-Fouling** | Probability of non-fouling properties | 0-1 scale, higher = lower probability of binding to off-targets | PeptideBERT, PepLand | 17,186 peptides | XGBoost + PeptideCLM embeddings |
+| **Permeability** | Cell membrane permeability (PAMPA lipophilicity score log P scale, range -10 to 0) | ≥ −6.0 indicate strong permeability and values < 6.0 indicate weak permeability | ChEMBL (22,040), CycPeptMPDB (7451) | 34,853 peptides | XGBoost + PeptideCLM embeddings + molecular descriptors |
+| **Binding Affinity** | Peptide-protein binding strength (-log Kd/Ki/IC50 scale) | Weak binding (< 6.0), medium binding (6.0 − 7.5), and high binding (≥ 7.5) | PepLand | 1806 peptide-protein pairs | Cross-attention transformer (ESM2 + PeptideCLM) |
+## Model Performance 🌟
+#### Binary Classification Predictors
+| Predictor | Val AUC | Val F1 |
+|-----------|----------------|----------|
+| **Non-Hemolysis** | 0.7902 | 0.8260 |
+| **Solubility** | 0.6016 | 0.5767 |
+| **Nonfouling** | TBD | TBD |
+#### Regression Predictors
+| Predictor | Train Correlation (Spearman) | Val Correlation (Spearman) |
+|-----------|------------------------------|----------------------------|
+| **Permeability** | 0.958 | 0.710 |
+| **Binding Affinity** | 0.805 | 0.611 |
+## Setup 🌟
+1. Clone the repository:
+```bash
+git clone https://github.com/sophtang/PeptiVerse.git
+cd PeptiVerse
+```
+2. Install environment:
+```bash
+conda env create -f environment.yml
+conda activate peptiverse
+```
+3. Change the `base_path` in each file to ensure that all model weights and tokenizers are loaded correctly.
+## Usage 🌟
+#### 1. Hemolysis Prediction
+Predicts the probability that a peptide is **not hemolytic**. Higher scores indicate safer peptides.
+```python
+import sys
+sys.path.append('/path/to/PeptiVerse')
+from functions.hemolysis.hemolysis import Hemolysis
+# Initialize predictor
+hemo = Hemolysis()
+# Input peptide in SMILES format
+peptides = [
+    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
+]
+# Get predictions
+scores = hemo(peptides)
+print(f"Non-hemolytic probability: {scores[0]:.3f}")
+```
+**Output interpretation:**
+- Score close to 1.0 = likely non-hemolytic (safe)
+- Score close to 0.0 = likely hemolytic (unsafe)
+---
+#### 2. Solubility Prediction
+Predicts aqueous solubility. Higher scores indicate better solubility.
+```python
+from functions.solubility.solubility import Solubility
+# Initialize predictor
+sol = Solubility()
+# Input peptide
+peptides = [
+    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
+]
+# Get predictions
+scores = sol(peptides)
+print(f"Solubility probability: {scores[0]:.3f}")
+```
+**Output interpretation:**
+- Score close to 1.0 = highly soluble
+- Score close to 0.0 = poorly soluble
+---
+#### 3. Nonfouling Prediction
+Predicts protein resistance/non-fouling properties.
+```python
+from functions.nonfouling.nonfouling import Nonfouling
+# Initialize predictor
+nf = Nonfouling()
+# Input peptide
+peptides = [
+    "NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC1=CN=C-N1)C(=O)O"
+]
+# Get predictions
+scores = nf(peptides)
+print(f"Nonfouling score: {scores[0]:.3f}")
+```
+**Output interpretation:**
+- Higher scores = better non-fouling properties
+---
+#### 4. Permeability Prediction
+Predicts membrane permeability on a log P scale.
+```python
+from functions.permeability.permeability import Permeability
+# Initialize predictor
+perm = Permeability()
+# Input peptide
+peptides = [
+    "N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1cNc2c1cc(O)cc2)C(=O)O"
+]
+# Get predictions
+scores = perm(peptides)
+print(f"Permeability (log P): {scores[0]:.3f}")
+```
+**Output interpretation:**
+- Higher values = more permeable
+- Typical range: -10 to 0 (log scale)
+---
+#### 5. Binding Affinity Prediction
+Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
+```python
+from functions.binding.binding import BindingAffinity
+# Target protein sequence (amino acid format)
+target_protein = "MTKSNGEEPKMGGRMERFQQGVRKRTLLAKKKVQNITKEDVKSYLFRNAFVLL..."
+# Initialize predictor with target protein
+binding = BindingAffinity(prot_seq=target_protein)
+# Input peptide in SMILES format
+peptides = [
+    "CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1c[nH]cn1)C(=O)O"
+]
+# Get predictions
+scores = binding(peptides)
+print(f"Binding affinity (-log Kd): {scores[0]:.3f}")
+```
+**Output interpretation:**
+- Higher values = stronger binding
+- Scale: -log(Kd/Ki/IC50)
+  - 7.5+ = tight binding (≤ ~30nM)
+  - 6.0-7.5 = medium binding (~30nM - 1μM)
+  - <6.0 = weak binding (> 1μM)
+---
+## Batch Processing 🌟
+All predictors support batch processing for multiple peptides:
+```python
+from functions.hemolysis.hemolysis import Hemolysis
+hemo = Hemolysis()
+# Multiple peptides
+peptides = [
+    "NCC(=O)N[C@H](CS)C(=O)O",
+    "CC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(=O)O)C(=O)O",
+    "N[C@@H](CO)C(=O)N[C@@H](CC(C)C)C(=O)O"
+]
+# Get predictions for all
+scores = hemo(peptides)
+for i, score in enumerate(scores):
+    print(f"Peptide {i+1}: {score:.3f}")
+```
+---
+## Unified Scoring with Multiple Predictors 🌟
+For convenience, you can use `scoring_functions.py` to evaluate multiple properties at once and get a score vector for each peptide.
+### Basic Usage
+```python
+import sys
+sys.path.append('/path/to/PeptiVerse')
+from scoring_functions import ScoringFunctions
+# Initialize with desired scoring functions
+# Available: 'binding_affinity1', 'binding_affinity2', 'permeability',
+#            'solubility', 'hemolysis', 'nonfouling'
+scoring = ScoringFunctions(
+    score_func_names=['solubility', 'hemolysis', 'nonfouling', 'permeability'],
+    prot_seqs=[]  # Empty if not using binding affinity
+)
+# Input peptides in SMILES format
+peptides = [
+    'N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)',
+    'NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O'
+]
+# Get scores (returns numpy array of shape: num_peptides x num_functions)
+scores = scoring(input_seqs=peptides)
+print(scores)
+```
+### Adding Binding Affinity
+```python
+from scoring_functions import ScoringFunctions
+# Target protein sequence (amino acid format)
+tfr_protein = "MMDQARSAFSNLFGGEPLSYTRFSLARQVDGDNSHVEMKLAVDEEENADNNT..."
+# Initialize with binding affinity for one protein
+scoring = ScoringFunctions(
+    score_func_names=['binding_affinity1', 'solubility', 'hemolysis', 'permeability'],
+    prot_seqs=[tfr_protein]  # Provide target protein sequence
+)
+peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](Cc1ccccc1)C2(=O)']
+scores = scoring(input_seqs=peptides)
+# scores[0] will contain: [binding_affinity, solubility, hemolysis, permeability]
+print(f"Scores for peptide 1:")
+print(f"  Binding Affinity: {scores[0][0]:.3f}")
+print(f"  Solubility: {scores[0][1]:.3f}")
+print(f"  Hemolysis: {scores[0][2]:.3f}")
+print(f"  Permeability: {scores[0][3]:.3f}")
+```
+### Multiple Binding Targets
+```python
+# For dual binding affinity prediction
+protein1 = "MMDQARSAFSNLFGGEPLSYTR..."  # First target
+protein2 = "MTKSNGEEPKMGGRMERFQQGV..."  # Second target
+scoring = ScoringFunctions(
+    score_func_names=['binding_affinity1', 'binding_affinity2', 'solubility', 'hemolysis'],
+    prot_seqs=[protein1, protein2]  # Provide both protein sequences
+)
+peptides = ['N2[C@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)...']
+scores = scoring(input_seqs=peptides)
+# scores[0] will contain: [binding_aff1, binding_aff2, solubility, hemolysis]
+```
+### Output Format
+The `ScoringFunctions` class returns a numpy array where:
+- **Rows**: Each row corresponds to one input peptide
+- **Columns**: Each column corresponds to one scoring function (in the order specified)
+```python
+# Example with 3 peptides and 4 scoring functions
+scores = scoring(input_seqs=peptides)
+# Shape: (3, 4)
+# scores[0] = [func1_score, func2_score, func3_score, func4_score] for peptide 1
+# scores[1] = [func1_score, func2_score, func3_score, func4_score] for peptide 2
+# scores[2] = [func1_score, func2_score, func3_score, func4_score] for peptide 3
+```
+---
+## Complete Example 🌟
+```python
+import sys
+sys.path.append('/path/to/PeptiVerse')
+from functions.hemolysis.hemolysis import Hemolysis
+from functions.solubility.solubility import Solubility
+from functions.permeability.permeability import Permeability
+# Initialize predictors
+hemo = Hemolysis()
+sol = Solubility()
+perm = Permeability()
+# Test peptide
+peptide = ["NCC(=O)N[C@H](CS)C(=O)N[C@@H](CO)C(=O)O"]
+# Get all predictions
+hemo_score = hemo(peptide)[0]
+sol_score = sol(peptide)[0]
+perm_score = perm(peptide)[0]
+print("Peptide Property Predictions:")
+print(f"  Hemolysis (non-hemolytic prob): {hemo_score:.3f}")
+print(f"  Solubility: {sol_score:.3f}")
+print(f"  Permeability: {perm_score:.3f}")
+```
+---
+## Model Architecture 🌟
+All predictors use:
+- **Embeddings**: PeptideCLM-23M (RoFormer-based peptide language model)
+- **Classifier**: XGBoost gradient boosting
+- **Input**: SMILES representation of peptides
+- **Training**: Models trained on curated datasets with cross-validation
+---
+## Citation
+If you find this repository helpful for your publications, please consider citing our paper:
+```
+@article{tang2025peptune,
+  title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
+  author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
+  journal={42nd International Conference on Machine Learning},
+  year={2025}
+}
+```
+To use this repository, you agree to abide by the MIT License.