File size: 18,862 Bytes

---
license: apache-2.0
---

![Overview of PeptiVerse](peptiverse-cover.png)

# PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌

This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

## Table of Contents

- [Quick start](#quick-start)
- [Installation](#installation)
- [Repository Structure](#repository-structure)
- [Training data collection](#training-data-collection)
- [Best model list](#best-model-list)
   - [Full model set (cuML-enabled)](#full-model-set-gpu-enabled)
   - [Minimal deployable model set (no cuML)](#minimal-deployable-set)
- [Usage](#usage)
   - [Local Application Hosting](#local-application-hosting)
   - [Dataset integration](#dataset-integration)
   - [Quick inference by property per model](#Quick-inference-by-property-per-model)
- [Property Interpretations](#property-interpretations)
- [Model Architecture](#model-architecture)
- [Troubleshooting](#troubleshooting)
- [Citation](#citation)

## Quick Start

```bash
# Clone repository
git clone https://huggingface.co/ChatterjeeLab/PeptiVerse

# Install dependencies
pip install -r requirements.txt

# Run inference
python inference.py
```
## Installation
### Minimal Setup
- Easy start-up environment (using transformers, xgboost models)
```bash
pip install -r requirements.txt
```
### Full Setup
- Additional access to trained SVM and ElastNet models requires installation of `RAPIDS cuML`, with instructions available from their official [github page](https://github.com/rapidsai/cuml) (**CUDA-capable GPU required**).
- Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at [Google drive](https://drive.google.com/file/d/1RJQ9HK0_gsPOhRo5H5ZmH_MYcpJqQD7e/view?usp=sharing) with everything you need (still need CUDA/GPU to load cuML models).
    ```
    # test
    apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)"

    # run inference (see below)
    apptainer exec peptiverse.sif python inference.py
    ```
## Repository Structure
This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1)

```
PeptiVerse/
├── training_data_cleaned/     # Processed datasets with embeddings
│   └── <property>/            # Property-specific data
│       ├── train/val splits
│       └── precomputed embeddings
├── training_classifiers/      # Trained model weights
│   └── <property>/           
│       ├── cnn_wt/           # CNN architectures
│       ├── mlp_wt/           # MLP architectures
│       └── xgb_wt/           # XGBoost models
├── tokenizer/                 # PeptideCLM tokenizer
├── training_data/             # Raw training data
├── inference.py               # Main prediction interface
├── best_models.txt            # Model selection manifest
└── requirements.txt           # Python dependencies
```

## Training Data Collection

<table>
  <caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption>
  <thead>
    <tr>
      <th rowspan="2"><strong>Properties</strong></th>
      <th colspan="2"><strong>Amino Acid Sequences</strong></th>
      <th colspan="2"><strong>SMILES Sequences</strong></th>
    </tr>
    <tr>
      <th><strong>0</strong></th>
      <th><strong>1</strong></th>
      <th><strong>0</strong></th>
      <th><strong>1</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td colspan="5"><strong>Classification</strong></td>
    </tr>
    <tr>
      <td>Hemolysis</td>
      <td>4765</td>
      <td>1311</td>
      <td>4765</td>
      <td>1311</td>
    </tr>
    <tr>
      <td>Non-Fouling</td>
      <td>13580</td>
      <td>3600</td>
      <td>13580</td>
      <td>3600</td>
    </tr>
    <tr>
      <td>Solubility</td>
      <td>9668</td>
      <td>8785</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>Permeability (Penetrance)</td>
      <td>1162</td>
      <td>1162</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>Toxicity</td>
      <td>-</td>
      <td>-</td>
      <td>5518</td>
      <td>5518</td>
    </tr>
    <tr>
      <td colspan="5"><strong>Regression (N)</strong></td>
    </tr>
    <tr>
      <td>Permeability (PAMPA)</td>
      <td colspan="2" align="center">-</td>
      <td colspan="2" align="center">6869</td>
    </tr>
    <tr>
      <td>Permeability (CACO2)</td>
      <td colspan="2" align="center">-</td>
      <td colspan="2" align="center">606</td>
    </tr>
    <tr>
      <td>Half-Life</td>
      <td colspan="2" align="center">130</td>
      <td colspan="2" align="center">245</td>
    </tr>
    <tr>
      <td>Binding Affinity</td>
      <td colspan="2" align="center">1436</td>
      <td colspan="2" align="center">1597</td>
    </tr>
  </tbody>
</table>


## Best Model List

### Full model set (cuML-enabled)
| Property                    | Best Model (Sequence) | Best Model (SMILES) | Task Type   | Threshold (Sequence) | Threshold (SMILES) |
|----------------------------|-----------------|---------------------|-------------|----------------|--------------------|
| Hemolysis                  | SVM             | Transformer         | Classifier  | 0.2521         | 0.4343             |
| Non-Fouling                | MLP             | ENET                | Classifier  | 0.57           | 0.6969             |
| Solubility                 | CNN             | –                   | Classifier  | 0.377          | –                  |
| Permeability (Penetrance)  | SVM             | –                   | Classifier  | 0.5493         | –                  |
| Toxicity                   | –               | Transformer         | Classifier  | –              | 0.3401             |
| Binding Affinity           | unpooled        | unpooled            | Regression  | –              | –                  |
| Permeability (PAMPA)       | –               | CNN                 | Regression  | –              | –                  |
| Permeability (Caco-2)      | –               | SVR                 | Regression  | –              | –                  |
| Half-life                  | Transformer     | XGB                 | Regression  | –              | –                  |
>Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.

### Minimal deployable model set (no cuML)
| Property                    | Best Model (WT) | Best Model (SMILES) | Task Type   | Threshold (WT) | Threshold (SMILES) |
|----------------------------|-----------------|---------------------|-------------|----------------|--------------------|
| Hemolysis                  | XGB             | Transformer         | Classifier  | 0.2801         | 0.4343             |
| Non-Fouling                | MLP             | XGB                 | Classifier  | 0.57           | 0.3982             |
| Solubility                 | CNN             | –                   | Classifier  | 0.377          | –                  |
| Permeability (Penetrance)  | XGB             | –                   | Classifier  | 0.4301         | –                  |
| Toxicity                   | –               | Transformer         | Classifier  | –              | 0.3401             |
| Binding Affinity           | unpooled        | unpooled            | Regression  | –              | –                  |
| Permeability (PAMPA)       | –               | CNN                 | Regression  | –              | –                  |
| Permeability (Caco-2)      | –               | SVR                 | Regression  | –              | –                  |
| Half-life                  | xgb_wt_log      | xgb_smiles          | Regression  | –              | –                  |

>Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. *xgb_wt_log* indicated log-scaled transformation of time during training.


## Usage

### Local Application Hosting
- Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources. 
```bash
# Configure models in best_models.txt

git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
python app.py
```
### Dataset integration
- All properties are provided with raw_data/split_ready_csvs/[huggingface_datasets](https://huggingface.co/docs/datasets/en/index).
- Selective download the data you need with `huggingface-cli`
```bash
huggingface-cli download ChatterjeeLab/PeptiVerse \
  --include "training_data_cleaned/**" \     # only this folder
  --exclude "**/*.pt" "**/*.joblib" \     # skip weights/artifacts
  --local-dir PeptiVerse_partial \
  --local-dir-use-symlinks False      # make real copies
```
- Or in python
```python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="ChatterjeeLab/PeptiVerse",
    allow_patterns=["training_data_cleaned/**"],     # only this folder
    ignore_patterns=["**/*.pt", "**/*.joblib"],     # skip weights/artifacts
    local_dir="PeptiVerse_partial",
    local_dir_use_symlinks=False,                   # make real copies
)
print("Downloaded to:", local_dir)
```
- Usage of the huggingface datasets (with pre-computed embeddings and splits)
    - All embedding datasets are saved via `DatasetDict.save_to_disk` and loadable with:
    ``` python
    from datasets import load_from_disk
    ds = load_from_disk(PATH)
    train_ds = ds["train"]
    val_ds = ds["val"]
    ```
- A) Sequence Based ([ESM-2](https://huggingface.co/facebook/esm2_t33_650M_UR50D) embeddings)
    - Pooled (fixed-length vector per sequence)
        - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
        - Each item:
            sequence: `str`;
            label: `int` (classification) or `float` (regression);
            embedding: `float32[H]` (H=1280 for ESM-2 650M);
    - Unpooled (variable-length token matrix)
        - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
        - Each item:
            sequence: `str`;
            label: `int` (classification) or `float` (regression);
            embedding: `float16[L, H]` (nested lists);
            attention_mask: `int8[L]`;
            length: `int` (=L);
- B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings)
    - Pooled (fixed-length vector per sequence)
        - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
        - Each item:
            sequence: `str` (SMILES);
            label: `int` (classification) or `float` (regression);
            embedding: `float32[H]`;
    - Unpooled (variable-length token matrix)
        - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
        - Each item:
            sequence: `str` (SMILES);
            label: `int` (classification) or `float` (regression);
            embedding: `float16[L, H]` (nested lists);
            attention_mask: `int8[L]`;
            length: `int` (=L);
    

### Quick Inference By Property Per Model
```python
from inference import PeptiVersePredictor

pred = PeptiVersePredictor(
    manifest_path="best_models.txt",          # best model list
    classifier_weight_root=".",               # repo root (where training_classifiers/ lives)
    device="cuda",                            # or "cpu"
)

# mode: smiles (SMILES-based models) / wt (Sequence-based models) 
# property keys (with some level of name normalization)
# hemolysis
# nf (Non-Fouling)
# solubility
# permeability_penetrance
# toxicity
# permeability_pampa
# permeability_caco2
# halflife
# binding_affinity

seq = "GIVEQCCTSICSLYQLENYCN"
smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O"

# Hemolysis
out = pred.predict_property("hemolysis", mode="wt", input_str=seq)
print(out)
# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...}

out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles)
print(out)

# Non-fouling (key is nf)
out = pred.predict_property("nf", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("nf", mode="smiles", input_str=smiles)
print(out)

# Solubility (Sequence-only)
out = pred.predict_property("solubility", mode="wt", input_str=seq)
print(out)

# Permeability (Penetrance) (Sequence-only)
out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq)
print(out)

# Toxicity (SMILES-only)
out = pred.predict_property("toxicity", mode="smiles", input_str=smiles)
print(out)

# Permeability (PAMPA) (SMILES regression)
out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles)
print(out)
# {"property":"permeability_pampa","mode":"smiles","score":value}

# Permeability (Caco-2) (SMILES regression)
out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles)
print(out)

# Half-life (sequence-based + SMILES regression)
out = pred.predict_property("halflife", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("halflife", mode="smiles", input_str=smiles)
print(out)

# Binding Affinity
protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..."  # target protein
peptide_seq = "GIVEQCCTSICSLYQLENYCN"

out = pred.predict_binding_affinity(
    mode="wt",
    target_seq=protein,
    binder_str=peptide_seq,
)
print(out)
# {
#   "property":"binding_affinity",
#   "mode":"wt",
#   "affinity": float,
#   "class_by_threshold": "High (≥9)" / "Moderate (7-9)" / "Low (<7)",
#   "class_by_logits": same buckets,
#   "binding_model": "pooled" or "unpooled",
# }

```

## Interpretation

You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab.

---
#### 🩸 Hemolysis Prediction
50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate. <br>

**Output interpretation:**<br>

- Score close to 1.0 = high probability of red blood cell membrane disruption<br>
- Score close to 0.0 = non-hemolytic
---

#### 💧 Solubility Prediction
Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.<br>

**Output interpretation:**<br>

- Score close to 1.0 = highly soluble<br>
- Score close to 0.0 = poorly soluble<br>
---

#### 👯 Non-Fouling Prediction
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.<br>

**Output interpretation:**<br>
- Score close to 1.0 = non-fouling<br>
- Score close to 0.0 = fouling<br>

---

#### 🪣 Permeability Prediction
Predicts membrane permeability on a log P scale.<br>

**Output interpretation:**<br>
- Higher values = more permeable (>-6.0)<br>
- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.<br>

---

#### ⏱️ Half-Life Prediction
**Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

---

#### ☠️ Toxicity Prediction
**Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.

---

#### 🔗 Binding Affinity Prediction

Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.<br>

**Interpretation:**<br>
    - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
    - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
    - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
    - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>


## Model Architecture

- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen.
- **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
- **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

## Troubleshooting

### LFS Download Issues

If files appear as SHA pointers:

```bash
huggingface-cli download ChatterjeeLab/PeptiVerse \
    training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \
    --local-dir . \
    --local-dir-use-symlinks False
```
### Trouble installing cuML
For error related to cuda library, reinstall the `torch` after installing `cuML`.

## Citation

If you find this repository helpful for your publications, please consider citing our paper:

```
@article {zhang2025peptiverse,
	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
	year = {2026},
	doi = {10.64898/2025.12.31.697180},
	URL = {https://doi.org/10.64898/2025.12.31.697180},
	journal = {bioRxiv}
}
```
To use this repository, you agree to abide by the Apache 2.0 license.