| --- |
| license: mit |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - biology |
| - proteins |
| - peptides |
| - binding-site-prediction |
| - esm3 |
| - kan |
| - protein-language-model |
| - drug-design |
| pipeline_tag: token-classification |
| --- |
| |
| # GeoPep |
|
|
| **Geometric-aware Peptide-Protein Binding Site Prediction** |
|
|
| GeoPep is a per-residue binding site predictor that combines the ESM3 protein |
| foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide |
| and a protein, it predicts which residues of the protein bind the peptide and |
| which residues of the peptide make contact. |
|
|
| - 📦 Code: https://github.com/Dian0212/GeoPep |
| - 📄 Checkpoint: `model_distanceLoss.ckpt` (~16 GB) |
|
|
| ## Model Summary |
|
|
| | | | |
| |---|---| |
| | **Backbone** | ESM3 (sm-open-v1, 1.4B params) | |
| | **Head** | 5 stacked FastKAN layers, 1536 → 1153 → 770 → 387 → 3 → 3 | |
| | **Granularity** | Per-residue (one prediction per amino acid) | |
| | **Classes** | 3 — `0`: non-interface, `1`: interface (binding), `2`: padding | |
| | **Input** | Peptide (≤50 residues) `|` Protein (≤500 residues), 551 tokens total | |
| | **Output** | Logits `[B, 3, 551]` → softmax binding probability per residue | |
| | **Training loss** | Weighted cross-entropy + differentiable geometric distance loss | |
|
|
| ## How to Use |
|
|
| ### 1. Install the GeoPep code |
|
|
| ```bash |
| git clone https://github.com/Dian0212/GeoPep.git |
| cd GeoPep |
| conda create -n geopep python=3.10 -y |
| conda activate geopep |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2. Download the checkpoint from this repo |
|
|
| ```bash |
| huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \ |
| --local-dir model_weights/ --local-dir-use-symlinks False |
| ``` |
|
|
| This places the file at `model_weights/model_distanceLoss.ckpt`. |
|
|
| ### 3. Run inference on your PDBs |
|
|
| Put PDB files (named `<PDBID>_<peptide_chain>_<protein_chain>.pdb`) into a |
| folder, then: |
|
|
| ```bash |
| cd scripts |
| python inference_pipeline.py \ |
| --pdb-dir /path/to/pdb \ |
| --checkpoint ../model_weights/model_distanceLoss.ckpt |
| ``` |
|
|
| Output: `result/predictions.json` with per-residue binding probabilities. |
|
|
| ### 4. Result format |
|
|
| ```json |
| { |
| "1a1r_C_A": { |
| "peptide_chain": "GSVVIVGRIVLSGKPA", |
| "protein_chain": "VEGEVQIVSTATQTFLAT...", |
| "peptide_bindingProbability": "0.99 0.99 0.99 ...", |
| "protein_bindingProbability": "0.35 0.97 0.12 ..." |
| } |
| } |
| ``` |
|
|
| - `peptide_chain` / `protein_chain` — actual residue sequences (padding stripped). |
| - `*_bindingProbability` — space-separated 2-decimal floats, one per residue. |
| Value is the raw class-1 (interface) softmax probability. |
|
|
| ## Architecture |
|
|
| ``` |
| input tokens [B, 553] BOS + 551 residues + EOS |
| │ |
| ESM3 encoder |
| ↓ embeddings[:, 1:552, :] drop BOS / EOS |
| [B, 551, 1536] per-residue embeddings |
| ↓ reshape |
| [B * 551, 1536] each residue independent |
| ↓ |
| KAN_1: 1536 → 1153 |
| KAN_2: 1153 → 770 |
| KAN_3: 770 → 387 |
| KAN_4: 387 → 3 |
| KAN_5: 3 → 3 |
| ↓ reshape + permute |
| logits [B, 3, 551] |
| ↓ softmax(dim=1) |
| binding probability = softmax[:, 1, :] |
| ``` |
|
|
| The per-residue head is what makes GeoPep different from the original |
| CLS-token approach — every residue's embedding flows independently through |
| the KAN stack, giving sharper position-level predictions. |
|
|
| ## Training |
|
|
| - **Dataset**: peptide–protein complexes from the PDB, encoded as |
| paired `complex/` (full structure) + `interface/` (binding residues only) |
| PDB files. |
| - **Length filters**: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues. |
| - **Loss**: |
| - Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]` |
| (padding contributes zero gradient). |
| - Differentiable distance loss |
| `L_dist = Σᵢ P_binding(i) · dist(i) / num_valid_residues`, |
| which penalizes high binding probability at residues far from the true |
| interface (distance computed from residue centers of mass). |
| - **Backbone**: ESM3 backbone is fine-tuned together with the KAN head. |
| - **Optimizer**: AdamW, learning rate 1e-4, weight decay 1e-4. |
| - **Mixed precision**: FP16. |
| |
| ## Input Format |
|
|
| The model expects a single concatenated string of length 551 tokens: |
|
|
| ``` |
| PEPTIDESEQ<pad><pad>...|PROTEINSEQ<pad><pad>... |
| |<------- 50 ------->|<-------- 500 ----------->| |
| ``` |
|
|
| | Position | Content | |
| |---|---| |
| | 0 .. 49 | Peptide (left-padded with `<pad>` to 50) | |
| | 50 | Separator `|` | |
| | 51 .. 550 | Protein (left-padded with `<pad>` to 500) | |
|
|
| The `inference_pipeline.py` script handles padding and tokenization for you — |
| you only provide raw PDB files. |
|
|
| ## Limitations |
|
|
| - **Sequence-length cap**: peptide must be ≤ 50 residues, protein ≤ 500 |
| residues. Sequences longer than this are silently skipped at the |
| preprocessing stage. |
| - **Requires per-PDB chain annotation**: filenames must follow the |
| `<PDBID>_<peptide_chain>_<protein_chain>.pdb` convention so the pipeline |
| knows which chain is the peptide. |
| - **No uncertainty calibration**: raw softmax probabilities are not calibrated; |
| use them as relative scores rather than absolute confidences. |
| - **Single peptide-protein pair per call**: multi-peptide or multi-chain |
| contexts are not supported in the standard pipeline. |
|
|
| ## Citation |
|
|
| If you use GeoPep in your work, please cite: |
|
|
| ```bibtex |
| @misc{geopep2026, |
| title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction}, |
| author = {Chen, Dian}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/Dian0212/GeoPep}} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT |