File size: 5,488 Bytes
5facad6 97248ad 5facad6 97248ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | ---
license: mit
language:
- en
library_name: pytorch
tags:
- biology
- proteins
- peptides
- binding-site-prediction
- esm3
- kan
- protein-language-model
- drug-design
pipeline_tag: token-classification
---
# GeoPep
**Geometric-aware Peptide-Protein Binding Site Prediction**
GeoPep is a per-residue binding site predictor that combines the ESM3 protein
foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide
and a protein, it predicts which residues of the protein bind the peptide and
which residues of the peptide make contact.
- π¦ Code: https://github.com/Dian0212/GeoPep
- π Checkpoint: `model_distanceLoss.ckpt` (~16 GB)
## Model Summary
| | |
|---|---|
| **Backbone** | ESM3 (sm-open-v1, 1.4B params) |
| **Head** | 5 stacked FastKAN layers, 1536 β 1153 β 770 β 387 β 3 β 3 |
| **Granularity** | Per-residue (one prediction per amino acid) |
| **Classes** | 3 β `0`: non-interface, `1`: interface (binding), `2`: padding |
| **Input** | Peptide (β€50 residues) `|` Protein (β€500 residues), 551 tokens total |
| **Output** | Logits `[B, 3, 551]` β softmax binding probability per residue |
| **Training loss** | Weighted cross-entropy + differentiable geometric distance loss |
## How to Use
### 1. Install the GeoPep code
```bash
git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10 -y
conda activate geopep
pip install -r requirements.txt
```
### 2. Download the checkpoint from this repo
```bash
huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
--local-dir model_weights/ --local-dir-use-symlinks False
```
This places the file at `model_weights/model_distanceLoss.ckpt`.
### 3. Run inference on your PDBs
Put PDB files (named `<PDBID>_<peptide_chain>_<protein_chain>.pdb`) into a
folder, then:
```bash
cd scripts
python inference_pipeline.py \
--pdb-dir /path/to/pdb \
--checkpoint ../model_weights/model_distanceLoss.ckpt
```
Output: `result/predictions.json` with per-residue binding probabilities.
### 4. Result format
```json
{
"1a1r_C_A": {
"peptide_chain": "GSVVIVGRIVLSGKPA",
"protein_chain": "VEGEVQIVSTATQTFLAT...",
"peptide_bindingProbability": "0.99 0.99 0.99 ...",
"protein_bindingProbability": "0.35 0.97 0.12 ..."
}
}
```
- `peptide_chain` / `protein_chain` β actual residue sequences (padding stripped).
- `*_bindingProbability` β space-separated 2-decimal floats, one per residue.
Value is the raw class-1 (interface) softmax probability.
## Architecture
```
input tokens [B, 553] BOS + 551 residues + EOS
β
ESM3 encoder
β embeddings[:, 1:552, :] drop BOS / EOS
[B, 551, 1536] per-residue embeddings
β reshape
[B * 551, 1536] each residue independent
β
KAN_1: 1536 β 1153
KAN_2: 1153 β 770
KAN_3: 770 β 387
KAN_4: 387 β 3
KAN_5: 3 β 3
β reshape + permute
logits [B, 3, 551]
β softmax(dim=1)
binding probability = softmax[:, 1, :]
```
The per-residue head is what makes GeoPep different from the original
CLS-token approach β every residue's embedding flows independently through
the KAN stack, giving sharper position-level predictions.
## Training
- **Dataset**: peptideβprotein complexes from the PDB, encoded as
paired `complex/` (full structure) + `interface/` (binding residues only)
PDB files.
- **Length filters**: peptide β [10, 50] residues, protein β [10, 500] residues.
- **Loss**:
- Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]`
(padding contributes zero gradient).
- Differentiable distance loss
`L_dist = Ξ£α΅’ P_binding(i) Β· dist(i) / num_valid_residues`,
which penalizes high binding probability at residues far from the true
interface (distance computed from residue centers of mass).
- **Backbone**: ESM3 backbone is fine-tuned together with the KAN head.
- **Optimizer**: AdamW, learning rate 1e-4, weight decay 1e-4.
- **Mixed precision**: FP16.
## Input Format
The model expects a single concatenated string of length 551 tokens:
```
PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
```
| Position | Content |
|---|---|
| 0 .. 49 | Peptide (left-padded with `<pad>` to 50) |
| 50 | Separator `|` |
| 51 .. 550 | Protein (left-padded with `<pad>` to 500) |
The `inference_pipeline.py` script handles padding and tokenization for you β
you only provide raw PDB files.
## Limitations
- **Sequence-length cap**: peptide must be β€ 50 residues, protein β€ 500
residues. Sequences longer than this are silently skipped at the
preprocessing stage.
- **Requires per-PDB chain annotation**: filenames must follow the
`<PDBID>_<peptide_chain>_<protein_chain>.pdb` convention so the pipeline
knows which chain is the peptide.
- **No uncertainty calibration**: raw softmax probabilities are not calibrated;
use them as relative scores rather than absolute confidences.
- **Single peptide-protein pair per call**: multi-peptide or multi-chain
contexts are not supported in the standard pipeline.
## Citation
If you use GeoPep in your work, please cite:
```bibtex
@misc{geopep2026,
title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
author = {Chen, Dian},
year = {2026},
howpublished = {\url{https://github.com/Dian0212/GeoPep}}
}
```
## License
MIT |