Spaces:

ChatterjeeLab
/

PeptiVerse

Running

File size: 7,907 Bytes

985c38b
 
 
 
 
 
 
c1bbdd6
 
 
a164d37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
985c38b
063d2f7
985c38b
 
82cd634
985c38b
c1bbdd6
985c38b
c1bbdd6
985c38b
82cd634
 
c1bbdd6
985c38b
 
c1bbdd6
985c38b
82cd634
 
c1bbdd6
985c38b
 
c1bbdd6
985c38b
 
82cd634
 
c1bbdd6
985c38b
 
c1bbdd6
985c38b
c1bbdd6
985c38b
 
 
 
 
c1bbdd6
 
985c38b
c1bbdd6
985c38b
 
 
 
 
 
 
c1bbdd6
 
 
985c38b
 
 
 
 
063d2f7
c1bbdd6
05f80db
a164d37
05f80db
3aedb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1bbdd6
 
 
 
063d2f7
 
 
 
 
 
 
 
 
 
 
c1bbdd6
 
 
 
985c38b

### Input Requirements and Constraints
> Supported Inputs
- Amino acid sequences: Linear peptides composed of standard 20 amino acids
- SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes
> Validation
- Invalid sequences or SMILES will be rejected
- Properties not supported are labeled as (Not Supported)
                
### Training Data Collection

**Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.

#### Classification (counts for class 0 / 1)

| Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) |
|---|---:|---:|---:|---:|
| Hemolysis | 4765 | 1311 | 4765 | 1311 |
| Non-Fouling | 13580 | 3600 | 13580 | 3600 |
| Solubility | 9668 | 8785 | – | – |
| Permeability (Penetrance) | 1162 | 1162 | – | – |
| Toxicity | – | – | 5518 | 5518 |

#### Regression (total N)

| Property | AA (N) | SMILES (N) |
|---|---:|---:|
| Permeability (PAMPA) | – | 6869 |
| Permeability (CACO2) | – | 606 |
| Half-Life | 130 | 245 |
| Binding Affinity | 1436 | 1597 |


Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our [paper](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1).

#### 🩸 Hemolysis Dataset
- **Primary Source:** [the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)](https://academic.oup.com/nar/article-abstract/49/D1/D288/5957160)
- **Secondary Source:** [peptide-dashboard](https://pubs.acs.org/doi/full/10.1021/acs.jcim.2c01317)
- **Description:** Probability of peptide disrupting red blood cell membranes.
- **Interpretation** 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

#### 💧 Solubility Dataset
- **Primary Source:** [PROSO-II](https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/j.1742-4658.2012.08603.x)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** Probability of peptide remaining dissolved in aqueous conditions.
- **Interpretation:** Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.


#### 👯 Non-Fouling Dataset
- **Primary Source:** [Classifying antimicrobial and multifunctional peptides with Bayesian network models](https://doi.org/10.1002/pep2.24079)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** A nonfouling peptide resists nonspecific interactions and protein adsorption.
- **Interpretation:** Outputs the probability (0–1) that a peptide resists nonspecific protein adsorption.
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.


#### 🪣 Permeability Dataset
- **Primary Source:** [CycPeptMPDB](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c01573), [PAMPA](https://doi.org/10.1517/17425255.1.2.325)
- **Secondary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Probability of peptide penetrating the cell membrane.
- **Interpretation:** For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp ≥ −6.0 indicates favorable permeability, while values below −6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and *vice versa*.


#### ⏱️ Half-Life Dataset
- **Primary Source:** [Thpdb2](https://doi.org/10.1016/j.drudis.2024.104047), [PepTherDia](https://doi.org/10.1016/j.drudis.2021.02.019), [peplife](https://www.nature.com/articles/srep36617)
- **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

#### ☠️ Toxicity Dataset
- **Primary Source:** [ToxinPred3.0](https://www.sciencedirect.com/science/article/pii/S0010482524010114)
- **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.


#### 🔗 Binding Affinity Dataset
- **Primary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
- **Description:** The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
- **Interpretation:**<br>
    - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
    - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
    - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
    - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>

### Model Architecture

- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen.
- **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
- **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

### Model Training and Weight Hosting
- More instructions can be found here at [PeptiVersse](https://huggingface.co/ChatterjeeLab/PeptiVerse)

### 🧪 Physicochemical Properties
                
#### Net Charge Calculation
- Uses Henderson-Hasselbalch equation
- pH-dependent calculation
- Considers all ionizable groups (K, R, H, D, E, C, Y, termini)

#### Isoelectric Point (pI)
- Bisection method to find pH where net charge = 0
- Precision: ±0.01 pH units

#### Hydrophobicity (GRAVY)
- Grand Average of Hydropathy
- Uses Kyte-Doolittle scale
- Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)

### Citation

If you use this tool, please cite:
```
@article {Zhang2025.12.31.697180,
	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
	elocation-id = {2025.12.31.697180},
	year = {2026},
	doi = {10.64898/2025.12.31.697180},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
	eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
	journal = {bioRxiv}
}
```

### Contact

For questions or collaborations: [yzhang@u.duke.nus.edu](mailto:yzhang@u.duke.nus.edu) or [pranam@seas.upenn.edu](mailto:pranam@seas.upenn.edu)