File size: 5,488 Bytes
5facad6
 
97248ad
 
 
 
 
 
 
 
 
 
 
 
 
5facad6
97248ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: mit
language:
- en
library_name: pytorch
tags:
- biology
- proteins
- peptides
- binding-site-prediction
- esm3
- kan
- protein-language-model
- drug-design
pipeline_tag: token-classification
---

# GeoPep

**Geometric-aware Peptide-Protein Binding Site Prediction**

GeoPep is a per-residue binding site predictor that combines the ESM3 protein
foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide
and a protein, it predicts which residues of the protein bind the peptide and
which residues of the peptide make contact.

- πŸ“¦ Code: https://github.com/Dian0212/GeoPep
- πŸ“„ Checkpoint: `model_distanceLoss.ckpt` (~16 GB)

## Model Summary

| | |
|---|---|
| **Backbone** | ESM3 (sm-open-v1, 1.4B params) |
| **Head** | 5 stacked FastKAN layers, 1536 β†’ 1153 β†’ 770 β†’ 387 β†’ 3 β†’ 3 |
| **Granularity** | Per-residue (one prediction per amino acid) |
| **Classes** | 3 β€” `0`: non-interface, `1`: interface (binding), `2`: padding |
| **Input** | Peptide (≀50 residues) `|` Protein (≀500 residues), 551 tokens total |
| **Output** | Logits `[B, 3, 551]` β†’ softmax binding probability per residue |
| **Training loss** | Weighted cross-entropy + differentiable geometric distance loss |

## How to Use

### 1. Install the GeoPep code

```bash
git clone https://github.com/Dian0212/GeoPep.git
cd GeoPep
conda create -n geopep python=3.10 -y
conda activate geopep
pip install -r requirements.txt
```

### 2. Download the checkpoint from this repo

```bash
huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
    --local-dir model_weights/ --local-dir-use-symlinks False
```

This places the file at `model_weights/model_distanceLoss.ckpt`.

### 3. Run inference on your PDBs

Put PDB files (named `<PDBID>_<peptide_chain>_<protein_chain>.pdb`) into a
folder, then:

```bash
cd scripts
python inference_pipeline.py \
    --pdb-dir /path/to/pdb \
    --checkpoint ../model_weights/model_distanceLoss.ckpt
```

Output: `result/predictions.json` with per-residue binding probabilities.

### 4. Result format

```json
{
  "1a1r_C_A": {
    "peptide_chain": "GSVVIVGRIVLSGKPA",
    "protein_chain": "VEGEVQIVSTATQTFLAT...",
    "peptide_bindingProbability": "0.99 0.99 0.99 ...",
    "protein_bindingProbability": "0.35 0.97 0.12 ..."
  }
}
```

- `peptide_chain` / `protein_chain` β€” actual residue sequences (padding stripped).
- `*_bindingProbability` β€” space-separated 2-decimal floats, one per residue.
  Value is the raw class-1 (interface) softmax probability.

## Architecture

```
input tokens [B, 553]                  BOS + 551 residues + EOS
       β”‚
   ESM3 encoder
       ↓ embeddings[:, 1:552, :]       drop BOS / EOS
   [B, 551, 1536]                      per-residue embeddings
       ↓ reshape
   [B * 551, 1536]                     each residue independent
       ↓
   KAN_1: 1536 β†’ 1153
   KAN_2: 1153 β†’ 770
   KAN_3:  770 β†’ 387
   KAN_4:  387 β†’ 3
   KAN_5:    3 β†’ 3
       ↓ reshape + permute
   logits [B, 3, 551]
       ↓ softmax(dim=1)
   binding probability = softmax[:, 1, :]
```

The per-residue head is what makes GeoPep different from the original
CLS-token approach β€” every residue's embedding flows independently through
the KAN stack, giving sharper position-level predictions.

## Training

- **Dataset**: peptide–protein complexes from the PDB, encoded as
  paired `complex/` (full structure) + `interface/` (binding residues only)
  PDB files.
- **Length filters**: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues.
- **Loss**:
  - Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]`
    (padding contributes zero gradient).
  - Differentiable distance loss
    `L_dist = Ξ£α΅’ P_binding(i) Β· dist(i) / num_valid_residues`,
    which penalizes high binding probability at residues far from the true
    interface (distance computed from residue centers of mass).
- **Backbone**: ESM3 backbone is fine-tuned together with the KAN head.
- **Optimizer**: AdamW, learning rate 1e-4, weight decay 1e-4.
- **Mixed precision**: FP16.

## Input Format

The model expects a single concatenated string of length 551 tokens:

```
PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>...
|<------- 50 ------->|<-------- 500 ----------->|
```

| Position | Content |
|---|---|
| 0 .. 49     | Peptide (left-padded with `<pad>` to 50) |
| 50          | Separator `|` |
| 51 .. 550   | Protein (left-padded with `<pad>` to 500) |

The `inference_pipeline.py` script handles padding and tokenization for you β€”
you only provide raw PDB files.

## Limitations

- **Sequence-length cap**: peptide must be ≀ 50 residues, protein ≀ 500
  residues. Sequences longer than this are silently skipped at the
  preprocessing stage.
- **Requires per-PDB chain annotation**: filenames must follow the
  `<PDBID>_<peptide_chain>_<protein_chain>.pdb` convention so the pipeline
  knows which chain is the peptide.
- **No uncertainty calibration**: raw softmax probabilities are not calibrated;
  use them as relative scores rather than absolute confidences.
- **Single peptide-protein pair per call**: multi-peptide or multi-chain
  contexts are not supported in the standard pipeline.

## Citation

If you use GeoPep in your work, please cite:

```bibtex
@misc{geopep2026,
  title  = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
  author = {Chen, Dian},
  year   = {2026},
  howpublished = {\url{https://github.com/Dian0212/GeoPep}}
}
```

## License

MIT