geopep / README.md
dchenqwer's picture
Update README.md
97248ad verified
---
license: mit
language:
- en
library_name: pytorch
tags:
- biology
- proteins
- peptides
- binding-site-prediction
- esm3
- kan
- protein-language-model
- drug-design
pipeline_tag: token-classification
---
# GeoPep
**Geometric-aware Peptide-Protein Binding Site Prediction**
GeoPep is a per-residue binding site predictor that combines the ESM3 protein
foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide
and a protein, it predicts which residues of the protein bind the peptide and
which residues of the peptide make contact.
- 📦 Code: https://github.com/Dian0212/GeoPep
- 📄 Checkpoint: `model_distanceLoss.ckpt` (~16 GB)
## Model Summary
| | |
|---|---|
| **Backbone** | ESM3 (sm-open-v1, 1.4B params) |
| **Head** | 5 stacked FastKAN layers, 1536 → 1153 → 770 → 387 → 3 → 3 |
| **Granularity** | Per-residue (one prediction per amino acid) |
| **Classes** | 3 — `0`: non-interface, `1`: interface (binding), `2`: padding |
| **Input** | Peptide (≤50 residues) `|` Protein (≤500 residues), 551 tokens total |
| **Output** | Logits `[B, 3, 551]` → softmax binding probability per residue |
| **Training loss** | Weighted cross-entropy + differentiable geometric distance loss |
## How to Use
### 1. Install the GeoPep code
```bash
git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10 -y
conda activate geopep
pip install -r requirements.txt
```
### 2. Download the checkpoint from this repo
```bash
huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
--local-dir model_weights/ --local-dir-use-symlinks False
```
This places the file at `model_weights/model_distanceLoss.ckpt`.
### 3. Run inference on your PDBs
Put PDB files (named `<PDBID>_<peptide_chain>_<protein_chain>.pdb`) into a
folder, then:
```bash
cd scripts
python inference_pipeline.py \
--pdb-dir /path/to/pdb \
--checkpoint ../model_weights/model_distanceLoss.ckpt
```
Output: `result/predictions.json` with per-residue binding probabilities.
### 4. Result format
```json
{
"1a1r_C_A": {
"peptide_chain": "GSVVIVGRIVLSGKPA",
"protein_chain": "VEGEVQIVSTATQTFLAT...",
"peptide_bindingProbability": "0.99 0.99 0.99 ...",
"protein_bindingProbability": "0.35 0.97 0.12 ..."
}
}
```
- `peptide_chain` / `protein_chain` — actual residue sequences (padding stripped).
- `*_bindingProbability` — space-separated 2-decimal floats, one per residue.
Value is the raw class-1 (interface) softmax probability.
## Architecture
```
input tokens [B, 553] BOS + 551 residues + EOS
│
ESM3 encoder
↓ embeddings[:, 1:552, :] drop BOS / EOS
[B, 551, 1536] per-residue embeddings
↓ reshape
[B * 551, 1536] each residue independent
↓
KAN_1: 1536 → 1153
KAN_2: 1153 → 770
KAN_3: 770 → 387
KAN_4: 387 → 3
KAN_5: 3 → 3
↓ reshape + permute
logits [B, 3, 551]
↓ softmax(dim=1)
binding probability = softmax[:, 1, :]
```
The per-residue head is what makes GeoPep different from the original
CLS-token approach — every residue's embedding flows independently through
the KAN stack, giving sharper position-level predictions.
## Training
- **Dataset**: peptide–protein complexes from the PDB, encoded as
paired `complex/` (full structure) + `interface/` (binding residues only)
PDB files.
- **Length filters**: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues.
- **Loss**:
- Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]`
(padding contributes zero gradient).
- Differentiable distance loss
`L_dist = Σᵢ P_binding(i) · dist(i) / num_valid_residues`,
which penalizes high binding probability at residues far from the true
interface (distance computed from residue centers of mass).
- **Backbone**: ESM3 backbone is fine-tuned together with the KAN head.
- **Optimizer**: AdamW, learning rate 1e-4, weight decay 1e-4.
- **Mixed precision**: FP16.
## Input Format
The model expects a single concatenated string of length 551 tokens:
```
PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
```
| Position | Content |
|---|---|
| 0 .. 49 | Peptide (left-padded with `<pad>` to 50) |
| 50 | Separator `|` |
| 51 .. 550 | Protein (left-padded with `<pad>` to 500) |
The `inference_pipeline.py` script handles padding and tokenization for you —
you only provide raw PDB files.
## Limitations
- **Sequence-length cap**: peptide must be ≤ 50 residues, protein ≤ 500
residues. Sequences longer than this are silently skipped at the
preprocessing stage.
- **Requires per-PDB chain annotation**: filenames must follow the
`<PDBID>_<peptide_chain>_<protein_chain>.pdb` convention so the pipeline
knows which chain is the peptide.
- **No uncertainty calibration**: raw softmax probabilities are not calibrated;
use them as relative scores rather than absolute confidences.
- **Single peptide-protein pair per call**: multi-peptide or multi-chain
contexts are not supported in the standard pipeline.
## Citation
If you use GeoPep in your work, please cite:
```bibtex
@misc{geopep2026,
title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
author = {Chen, Dian},
year = {2026},
howpublished = {\url{https://github.com/Dian0212/GeoPep}}
}
```
## License
MIT