Spaces:
Running
Running
File size: 7,907 Bytes
985c38b c1bbdd6 a164d37 985c38b 063d2f7 985c38b 82cd634 985c38b c1bbdd6 985c38b c1bbdd6 985c38b 82cd634 c1bbdd6 985c38b c1bbdd6 985c38b 82cd634 c1bbdd6 985c38b c1bbdd6 985c38b 82cd634 c1bbdd6 985c38b c1bbdd6 985c38b c1bbdd6 985c38b c1bbdd6 985c38b c1bbdd6 985c38b c1bbdd6 985c38b 063d2f7 c1bbdd6 05f80db a164d37 05f80db 3aedb16 c1bbdd6 063d2f7 c1bbdd6 985c38b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
### Input Requirements and Constraints
> Supported Inputs
- Amino acid sequences: Linear peptides composed of standard 20 amino acids
- SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes
> Validation
- Invalid sequences or SMILES will be rejected
- Properties not supported are labeled as (Not Supported)
### Training Data Collection
**Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.
#### Classification (counts for class 0 / 1)
| Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) |
|---|---:|---:|---:|---:|
| Hemolysis | 4765 | 1311 | 4765 | 1311 |
| Non-Fouling | 13580 | 3600 | 13580 | 3600 |
| Solubility | 9668 | 8785 | β | β |
| Permeability (Penetrance) | 1162 | 1162 | β | β |
| Toxicity | β | β | 5518 | 5518 |
#### Regression (total N)
| Property | AA (N) | SMILES (N) |
|---|---:|---:|
| Permeability (PAMPA) | β | 6869 |
| Permeability (CACO2) | β | 606 |
| Half-Life | 130 | 245 |
| Binding Affinity | 1436 | 1597 |
Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our [paper](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1).
#### π©Έ Hemolysis Dataset
- **Primary Source:** [the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)](https://academic.oup.com/nar/article-abstract/49/D1/D288/5957160)
- **Secondary Source:** [peptide-dashboard](https://pubs.acs.org/doi/full/10.1021/acs.jcim.2c01317)
- **Description:** Probability of peptide disrupting red blood cell membranes.
- **Interpretation** 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.
#### π§ Solubility Dataset
- **Primary Source:** [PROSO-II](https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/j.1742-4658.2012.08603.x)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** Probability of peptide remaining dissolved in aqueous conditions.
- **Interpretation:** Outputs a probability (0β1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.
#### π― Non-Fouling Dataset
- **Primary Source:** [Classifying antimicrobial and multifunctional peptides with Bayesian network models](https://doi.org/10.1002/pep2.24079)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** A nonfouling peptide resists nonspecific interactions and protein adsorption.
- **Interpretation:** Outputs the probability (0β1) that a peptide resists nonspecific protein adsorption.
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.
#### πͺ£ Permeability Dataset
- **Primary Source:** [CycPeptMPDB](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c01573), [PAMPA](https://doi.org/10.1517/17425255.1.2.325)
- **Secondary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Probability of peptide penetrating the cell membrane.
- **Interpretation:** For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp β₯ β6.0 indicates favorable permeability, while values below β6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and *vice versa*.
#### β±οΈ Half-Life Dataset
- **Primary Source:** [Thpdb2](https://doi.org/10.1016/j.drudis.2024.104047), [PepTherDia](https://doi.org/10.1016/j.drudis.2021.02.019), [peplife](https://www.nature.com/articles/srep36617)
- **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
#### β οΈ Toxicity Dataset
- **Primary Source:** [ToxinPred3.0](https://www.sciencedirect.com/science/article/pii/S0010482524010114)
- **Interpretation:** Outputs a probability (0β1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
#### π Binding Affinity Dataset
- **Primary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
- **Description:** The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
- **Interpretation:**<br>
- Scores β₯ 9 correspond to tight binders (K β€ 10β»βΉ M, nanomolar to picomolar range)<br>
- Scores between 7 and 9 correspond to medium binders (10β»β·β10β»βΉ M, nanomolar to micromolar range)<br>
- Scores < 7 correspond to weak binders (K β₯ 10β»βΆ M, micromolar and weaker)<br>
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>
### Model Architecture
- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen.
- **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
- **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.
### Model Training and Weight Hosting
- More instructions can be found here at [PeptiVersse](https://huggingface.co/ChatterjeeLab/PeptiVerse)
### π§ͺ Physicochemical Properties
#### Net Charge Calculation
- Uses Henderson-Hasselbalch equation
- pH-dependent calculation
- Considers all ionizable groups (K, R, H, D, E, C, Y, termini)
#### Isoelectric Point (pI)
- Bisection method to find pH where net charge = 0
- Precision: Β±0.01 pH units
#### Hydrophobicity (GRAVY)
- Grand Average of Hydropathy
- Uses Kyte-Doolittle scale
- Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)
### Citation
If you use this tool, please cite:
```
@article {Zhang2025.12.31.697180,
author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
elocation-id = {2025.12.31.697180},
year = {2026},
doi = {10.64898/2025.12.31.697180},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
journal = {bioRxiv}
}
```
### Contact
For questions or collaborations: [yzhang@u.duke.nus.edu](mailto:yzhang@u.duke.nus.edu) or [pranam@seas.upenn.edu](mailto:pranam@seas.upenn.edu) |