### Input Requirements and Constraints
> Supported Inputs
- Amino acid sequences: Linear peptides composed of standard 20 amino acids
- SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes
> Validation
- Invalid sequences or SMILES will be rejected
- Properties not supported are labeled as (Not Supported)
                
### Training Data Collection

**Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.

#### Classification (counts for class 0 / 1)

| Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) |
|---|---:|---:|---:|---:|
| Hemolysis | 4765 | 1311 | 4765 | 1311 |
| Non-Fouling | 13580 | 3600 | 13580 | 3600 |
| Solubility | 9668 | 8785 | – | – |
| Permeability (Penetrance) | 1162 | 1162 | – | – |
| Toxicity | – | – | 5518 | 5518 |

#### Regression (total N)

| Property | AA (N) | SMILES (N) |
|---|---:|---:|
| Permeability (PAMPA) | – | 6869 |
| Permeability (CACO2) | – | 606 |
| Half-Life | 130 | 245 |
| Binding Affinity | 1436 | 1597 |


Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our [paper](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1).

#### 🩸 Hemolysis Dataset
- **Primary Source:** [the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)](https://academic.oup.com/nar/article-abstract/49/D1/D288/5957160)
- **Secondary Source:** [peptide-dashboard](https://pubs.acs.org/doi/full/10.1021/acs.jcim.2c01317)
- **Description:** Probability of peptide disrupting red blood cell membranes.
- **Interpretation** 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

#### 💧 Solubility Dataset
- **Primary Source:** [PROSO-II](https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/j.1742-4658.2012.08603.x)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** Probability of peptide remaining dissolved in aqueous conditions.
- **Interpretation:** Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.


#### 👯 Non-Fouling Dataset
- **Primary Source:** [Classifying antimicrobial and multifunctional peptides with Bayesian network models](https://doi.org/10.1002/pep2.24079)
- **Secondary Source:** [peptideBERT](https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.3c02398)
- **Description:** A nonfouling peptide resists nonspecific interactions and protein adsorption.
- **Interpretation:** Outputs the probability (0–1) that a peptide resists nonspecific protein adsorption.
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.


#### 🪣 Permeability Dataset
- **Primary Source:** [CycPeptMPDB](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.2c01573), [PAMPA](https://doi.org/10.1517/17425255.1.2.325)
- **Secondary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Probability of peptide penetrating the cell membrane.
- **Interpretation:** For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp ≥ −6.0 indicates favorable permeability, while values below −6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and *vice versa*.


#### ⏱️ Half-Life Dataset
- **Primary Source:** [Thpdb2](https://doi.org/10.1016/j.drudis.2024.104047), [PepTherDia](https://doi.org/10.1016/j.drudis.2021.02.019), [peplife](https://www.nature.com/articles/srep36617)
- **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

#### ☠️ Toxicity Dataset
- **Primary Source:** [ToxinPred3.0](https://www.sciencedirect.com/science/article/pii/S0010482524010114)
- **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.


#### 🔗 Binding Affinity Dataset
- **Primary Source:** [PepLand](https://arxiv.org/abs/2311.04419)
- **Description:** Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
- **Description:** The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
- **Interpretation:**<br>
    - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
    - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
    - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
    - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>

### Model Architecture

- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen.
- **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
- **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

### Model Training and Weight Hosting
- More instructions can be found here at [PeptiVersse](https://huggingface.co/ChatterjeeLab/PeptiVerse)

### 🧪 Physicochemical Properties
                
#### Net Charge Calculation
- Uses Henderson-Hasselbalch equation
- pH-dependent calculation
- Considers all ionizable groups (K, R, H, D, E, C, Y, termini)

#### Isoelectric Point (pI)
- Bisection method to find pH where net charge = 0
- Precision: ±0.01 pH units

#### Hydrophobicity (GRAVY)
- Grand Average of Hydropathy
- Uses Kyte-Doolittle scale
- Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)

### Citation

If you use this tool, please cite:
```
@article {Zhang2025.12.31.697180,
	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
	elocation-id = {2025.12.31.697180},
	year = {2026},
	doi = {10.64898/2025.12.31.697180},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
	eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
	journal = {bioRxiv}
}
```

### Contact

For questions or collaborations: [yzhang@u.duke.nus.edu](mailto:yzhang@u.duke.nus.edu) or [pranam@seas.upenn.edu](mailto:pranam@seas.upenn.edu)