Spaces:
Running
Running
| ### Input Requirements and Constraints | |
| > Supported Inputs | |
| - Amino acid sequences: Linear peptides composed of standard 20 amino acids | |
| - SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes | |
| > Validation | |
| - Invalid sequences or SMILES will be rejected | |
| - Properties not supported are labeled as (Not Supported) | |
| ### Training Data Collection | |
| **Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences. | |
| #### Classification (counts for class 0 / 1) | |
| | Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) | | |
| |---|---:|---:|---:|---:| | |
| | Hemolysis | 4765 | 1311 | 4765 | 1311 | | |
| | Non-Fouling | 13580 | 3600 | 13580 | 3600 | | |
| | Solubility | 9668 | 8785 | β | β | | |
| | Permeability (Penetrance) | 1162 | 1162 | β | β | | |
| | Toxicity | β | β | 5518 | 5518 | | |
| #### Regression (total N) | |
| | Property | AA (N) | SMILES (N) | | |
| |---|---:|---:| | |
| | Permeability (PAMPA) | β | 6869 | | |
| | Permeability (CACO2) | β | 606 | | |
| | Half-Life | 130 | 245 | | |
| | Binding Affinity | 1436 | 1597 | | |
| Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our [paper](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1). | |
| #### π©Έ Hemolysis Dataset | |
| - **Primary Source:** [the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)](https://academic.oup.com/nar/article-abstract/49/D1/D288/5957160) | |
| - **Secondary Source:** [peptide-dashboard](https://pubs.acs.org/doi/full/10.1021/acs.jcim.2c01317) | |
| - **Description:** Probability of peptide disrupting red blood cell membranes. | |
| - **Interpretation** 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate. | |
| #### π§ Solubility Dataset | |
| - **Primary Source:** [PROSO-II](https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/j.1742-4658.2012.08603.x) | |
| - **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398) | |
| - **Description:** Probability of peptide remaining dissolved in aqueous conditions. | |
| - **Interpretation:** Outputs a probability (0β1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability. | |
| #### π― Non-Fouling Dataset | |
| - **Primary Source:** [Classifying antimicrobial and multifunctional peptides with Bayesian network models](https://doi.org/10.1002/pep2.24079) | |
| - **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398) | |
| - **Description:** A nonfouling peptide resists nonspecific interactions and protein adsorption. | |
| - **Interpretation:** Outputs the probability (0β1) that a peptide resists nonspecific protein adsorption. | |
| Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications. | |
| #### πͺ£ Permeability Dataset | |
| - **Primary Source:** [CycPeptMPDB](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c01573), [PAMPA](https://doi.org/10.1517/17425255.1.2.325) | |
| - **Secondary Source:** [PepLand](https://arxiv.org/abs/2311.04419) | |
| - **Description:** Probability of peptide penetrating the cell membrane. | |
| - **Interpretation:** For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp β₯ β6.0 indicates favorable permeability, while values below β6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and *vice versa*. | |
| #### β±οΈ Half-Life Dataset | |
| - **Primary Source:** [Thpdb2](https://doi.org/10.1016/j.drudis.2024.104047), [PepTherDia](https://doi.org/10.1016/j.drudis.2021.02.019), [peplife](https://www.nature.com/articles/srep36617) | |
| - **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation. | |
| #### β οΈ Toxicity Dataset | |
| - **Primary Source:** [ToxinPred3.0](https://www.sciencedirect.com/science/article/pii/S0010482524010114) | |
| - **Interpretation:** Outputs a probability (0β1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk. | |
| #### π Binding Affinity Dataset | |
| - **Primary Source:** [PepLand](https://arxiv.org/abs/2311.04419) | |
| - **Description:** Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50. | |
| - **Description:** The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target. | |
| - **Interpretation:**<br> | |
| - Scores β₯ 9 correspond to tight binders (K β€ 10β»βΉ M, nanomolar to picomolar range)<br> | |
| - Scores between 7 and 9 correspond to medium binders (10β»β·β10β»βΉ M, nanomolar to micromolar range)<br> | |
| - Scores < 7 correspond to weak binders (K β₯ 10β»βΆ M, micromolar and weaker)<br> | |
| - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br> | |
| ### Model Architecture | |
| - **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen. | |
| - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction. | |
| - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns. | |
| - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations. | |
| - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets. | |
| - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository. | |
| ### Model Training and Weight Hosting | |
| - More instructions can be found here at [PeptiVersse](https://huggingface.co/ChatterjeeLab/PeptiVerse) | |
| ### π§ͺ Physicochemical Properties | |
| #### Net Charge Calculation | |
| - Uses Henderson-Hasselbalch equation | |
| - pH-dependent calculation | |
| - Considers all ionizable groups (K, R, H, D, E, C, Y, termini) | |
| #### Isoelectric Point (pI) | |
| - Bisection method to find pH where net charge = 0 | |
| - Precision: Β±0.01 pH units | |
| #### Hydrophobicity (GRAVY) | |
| - Grand Average of Hydropathy | |
| - Uses Kyte-Doolittle scale | |
| - Range: -4.5 (hydrophilic) to +4.5 (hydrophobic) | |
| ### Citation | |
| If you use this tool, please cite: | |
| ``` | |
| @article {Zhang2025.12.31.697180, | |
| author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam}, | |
| title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction}, | |
| elocation-id = {2025.12.31.697180}, | |
| year = {2026}, | |
| doi = {10.64898/2025.12.31.697180}, | |
| publisher = {Cold Spring Harbor Laboratory}, | |
| URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180}, | |
| eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf}, | |
| journal = {bioRxiv} | |
| } | |
| ``` | |
| ### Contact | |
| For questions or collaborations: [yzhang@u.duke.nus.edu](mailto:yzhang@u.duke.nus.edu) or [pranam@seas.upenn.edu](mailto:pranam@seas.upenn.edu) |