PeptiVerse / description.md
ynuozhang
update link
063d2f7

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Input Requirements and Constraints

Supported Inputs

  • Amino acid sequences: Linear peptides composed of standard 20 amino acids
  • SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes

    Validation

  • Invalid sequences or SMILES will be rejected
  • Properties not supported are labeled as (Not Supported)

Training Data Collection

Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.

Classification (counts for class 0 / 1)

Property AA (0) AA (1) SMILES (0) SMILES (1)
Hemolysis 4765 1311 4765 1311
Non-Fouling 13580 3600 13580 3600
Solubility 9668 8785 – –
Permeability (Penetrance) 1162 1162 – –
Toxicity – – 5518 5518

Regression (total N)

Property AA (N) SMILES (N)
Permeability (PAMPA) – 6869
Permeability (CACO2) – 606
Half-Life 130 245
Binding Affinity 1436 1597

Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our paper.

🩸 Hemolysis Dataset

  • Primary Source: the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)
  • Secondary Source: peptide-dashboard
  • Description: Probability of peptide disrupting red blood cell membranes.
  • Interpretation 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

πŸ’§ Solubility Dataset

  • Primary Source: PROSO-II
  • Secondary Source: peptideBERT
  • Description: Probability of peptide remaining dissolved in aqueous conditions.
  • Interpretation: Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.

πŸ‘― Non-Fouling Dataset

πŸͺ£ Permeability Dataset

  • Primary Source: CycPeptMPDB, PAMPA
  • Secondary Source: PepLand
  • Description: Probability of peptide penetrating the cell membrane.
  • Interpretation: For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp β‰₯ βˆ’6.0 indicates favorable permeability, while values below βˆ’6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and vice versa.

⏱️ Half-Life Dataset

  • Primary Source: Thpdb2, PepTherDia, peplife
  • Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

☠️ Toxicity Dataset

  • Primary Source: ToxinPred3.0
  • Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.

πŸ”— Binding Affinity Dataset

  • Primary Source: PepLand
  • Description: Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
  • Description: The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
  • Interpretation:
    • Scores β‰₯ 9 correspond to tight binders (K ≀ 10⁻⁹ M, nanomolar to picomolar range)
    • Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
    • Scores < 7 correspond to weak binders (K β‰₯ 10⁻⁢ M, micromolar and weaker)
    • A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.

Model Architecture

  • Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
  • XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
  • CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
  • Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
  • SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
  • Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

Model Training and Weight Hosting

πŸ§ͺ Physicochemical Properties

Net Charge Calculation

  • Uses Henderson-Hasselbalch equation
  • pH-dependent calculation
  • Considers all ionizable groups (K, R, H, D, E, C, Y, termini)

Isoelectric Point (pI)

  • Bisection method to find pH where net charge = 0
  • Precision: Β±0.01 pH units

Hydrophobicity (GRAVY)

  • Grand Average of Hydropathy
  • Uses Kyte-Doolittle scale
  • Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)

Citation

If you use this tool, please cite:

@article {Zhang2025.12.31.697180,
    author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
    title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
    elocation-id = {2025.12.31.697180},
    year = {2026},
    doi = {10.64898/2025.12.31.697180},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
    eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
    journal = {bioRxiv}
}

Contact

For questions or collaborations: yzhang@u.duke.nus.edu or pranam@seas.upenn.edu