### Input Requirements and Constraints > Supported Inputs - Amino acid sequences: Linear peptides composed of standard 20 amino acids - SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes > Validation - Invalid sequences or SMILES will be rejected - Properties not supported are labeled as (Not Supported) ### Training Data Collection **Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences. #### Classification (counts for class 0 / 1) | Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) | |---|---:|---:|---:|---:| | Hemolysis | 4765 | 1311 | 4765 | 1311 | | Non-Fouling | 13580 | 3600 | 13580 | 3600 | | Solubility | 9668 | 8785 | – | – | | Permeability (Penetrance) | 1162 | 1162 | – | – | | Toxicity | – | – | 5518 | 5518 | #### Regression (total N) | Property | AA (N) | SMILES (N) | |---|---:|---:| | Permeability (PAMPA) | – | 6869 | | Permeability (CACO2) | – | 606 | | Half-Life | 130 | 245 | | Binding Affinity | 1436 | 1597 | Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our [paper](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1). #### 🩸 Hemolysis Dataset - **Primary Source:** [the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)](https://academic.oup.com/nar/article-abstract/49/D1/D288/5957160) - **Secondary Source:** [peptide-dashboard](https://pubs.acs.org/doi/full/10.1021/acs.jcim.2c01317) - **Description:** Probability of peptide disrupting red blood cell membranes. - **Interpretation** 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate. #### 💧 Solubility Dataset - **Primary Source:** [PROSO-II](https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/j.1742-4658.2012.08603.x) - **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398) - **Description:** Probability of peptide remaining dissolved in aqueous conditions. - **Interpretation:** Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability. #### 👯 Non-Fouling Dataset - **Primary Source:** [Classifying antimicrobial and multifunctional peptides with Bayesian network models](https://doi.org/10.1002/pep2.24079) - **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398) - **Description:** A nonfouling peptide resists nonspecific interactions and protein adsorption. - **Interpretation:** Outputs the probability (0–1) that a peptide resists nonspecific protein adsorption. Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications. #### 🪣 Permeability Dataset - **Primary Source:** [CycPeptMPDB](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c01573), [PAMPA](https://doi.org/10.1517/17425255.1.2.325) - **Secondary Source:** [PepLand](https://arxiv.org/abs/2311.04419) - **Description:** Probability of peptide penetrating the cell membrane. - **Interpretation:** For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp ≥ −6.0 indicates favorable permeability, while values below −6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and *vice versa*. #### ⏱️ Half-Life Dataset - **Primary Source:** [Thpdb2](https://doi.org/10.1016/j.drudis.2024.104047), [PepTherDia](https://doi.org/10.1016/j.drudis.2021.02.019), [peplife](https://www.nature.com/articles/srep36617) - **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation. #### ☠️ Toxicity Dataset - **Primary Source:** [ToxinPred3.0](https://www.sciencedirect.com/science/article/pii/S0010482524010114) - **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk. #### 🔗 Binding Affinity Dataset - **Primary Source:** [PepLand](https://arxiv.org/abs/2311.04419) - **Description:** Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50. - **Description:** The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target. - **Interpretation:**
- Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.
### Model Architecture - **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen. - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction. - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns. - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations. - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets. - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository. ### Model Training and Weight Hosting - More instructions can be found here at [PeptiVersse](https://huggingface.co/ChatterjeeLab/PeptiVerse) ### 🧪 Physicochemical Properties #### Net Charge Calculation - Uses Henderson-Hasselbalch equation - pH-dependent calculation - Considers all ionizable groups (K, R, H, D, E, C, Y, termini) #### Isoelectric Point (pI) - Bisection method to find pH where net charge = 0 - Precision: ±0.01 pH units #### Hydrophobicity (GRAVY) - Grand Average of Hydropathy - Uses Kyte-Doolittle scale - Range: -4.5 (hydrophilic) to +4.5 (hydrophobic) ### Citation If you use this tool, please cite: ``` @article {Zhang2025.12.31.697180, author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam}, title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction}, elocation-id = {2025.12.31.697180}, year = {2026}, doi = {10.64898/2025.12.31.697180}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180}, eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf}, journal = {bioRxiv} } ``` ### Contact For questions or collaborations: [yzhang@u.duke.nus.edu](mailto:yzhang@u.duke.nus.edu) or [pranam@seas.upenn.edu](mailto:pranam@seas.upenn.edu)