--- license: mit language: - en library_name: pytorch tags: - biology - proteins - peptides - binding-site-prediction - esm3 - kan - protein-language-model - drug-design pipeline_tag: token-classification --- # GeoPep **Geometric-aware Peptide-Protein Binding Site Prediction** GeoPep is a per-residue binding site predictor that combines the ESM3 protein foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide and a protein, it predicts which residues of the protein bind the peptide and which residues of the peptide make contact. - πŸ“¦ Code: https://github.com/Dian0212/GeoPep - πŸ“„ Checkpoint: `model_distanceLoss.ckpt` (~16 GB) ## Model Summary | | | |---|---| | **Backbone** | ESM3 (sm-open-v1, 1.4B params) | | **Head** | 5 stacked FastKAN layers, 1536 β†’ 1153 β†’ 770 β†’ 387 β†’ 3 β†’ 3 | | **Granularity** | Per-residue (one prediction per amino acid) | | **Classes** | 3 β€” `0`: non-interface, `1`: interface (binding), `2`: padding | | **Input** | Peptide (≀50 residues) `|` Protein (≀500 residues), 551 tokens total | | **Output** | Logits `[B, 3, 551]` β†’ softmax binding probability per residue | | **Training loss** | Weighted cross-entropy + differentiable geometric distance loss | ## How to Use ### 1. Install the GeoPep code ```bash git clone https://github.com/Dian0212/GeoPep.git cd GeoPep conda create -n geopep python=3.10 -y conda activate geopep pip install -r requirements.txt ``` ### 2. Download the checkpoint from this repo ```bash huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \ --local-dir model_weights/ --local-dir-use-symlinks False ``` This places the file at `model_weights/model_distanceLoss.ckpt`. ### 3. Run inference on your PDBs Put PDB files (named `__.pdb`) into a folder, then: ```bash cd scripts python inference_pipeline.py \ --pdb-dir /path/to/pdb \ --checkpoint ../model_weights/model_distanceLoss.ckpt ``` Output: `result/predictions.json` with per-residue binding probabilities. ### 4. Result format ```json { "1a1r_C_A": { "peptide_chain": "GSVVIVGRIVLSGKPA", "protein_chain": "VEGEVQIVSTATQTFLAT...", "peptide_bindingProbability": "0.99 0.99 0.99 ...", "protein_bindingProbability": "0.35 0.97 0.12 ..." } } ``` - `peptide_chain` / `protein_chain` β€” actual residue sequences (padding stripped). - `*_bindingProbability` β€” space-separated 2-decimal floats, one per residue. Value is the raw class-1 (interface) softmax probability. ## Architecture ``` input tokens [B, 553] BOS + 551 residues + EOS β”‚ ESM3 encoder ↓ embeddings[:, 1:552, :] drop BOS / EOS [B, 551, 1536] per-residue embeddings ↓ reshape [B * 551, 1536] each residue independent ↓ KAN_1: 1536 β†’ 1153 KAN_2: 1153 β†’ 770 KAN_3: 770 β†’ 387 KAN_4: 387 β†’ 3 KAN_5: 3 β†’ 3 ↓ reshape + permute logits [B, 3, 551] ↓ softmax(dim=1) binding probability = softmax[:, 1, :] ``` The per-residue head is what makes GeoPep different from the original CLS-token approach β€” every residue's embedding flows independently through the KAN stack, giving sharper position-level predictions. ## Training - **Dataset**: peptide–protein complexes from the PDB, encoded as paired `complex/` (full structure) + `interface/` (binding residues only) PDB files. - **Length filters**: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues. - **Loss**: - Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]` (padding contributes zero gradient). - Differentiable distance loss `L_dist = Ξ£α΅’ P_binding(i) Β· dist(i) / num_valid_residues`, which penalizes high binding probability at residues far from the true interface (distance computed from residue centers of mass). - **Backbone**: ESM3 backbone is fine-tuned together with the KAN head. - **Optimizer**: AdamW, learning rate 1e-4, weight decay 1e-4. - **Mixed precision**: FP16. ## Input Format The model expects a single concatenated string of length 551 tokens: ``` PEPTIDESEQ...|PROTEINSEQ... |<------- 50 ------->|<-------- 500 ----------->| ``` | Position | Content | |---|---| | 0 .. 49 | Peptide (left-padded with `` to 50) | | 50 | Separator `|` | | 51 .. 550 | Protein (left-padded with `` to 500) | The `inference_pipeline.py` script handles padding and tokenization for you β€” you only provide raw PDB files. ## Limitations - **Sequence-length cap**: peptide must be ≀ 50 residues, protein ≀ 500 residues. Sequences longer than this are silently skipped at the preprocessing stage. - **Requires per-PDB chain annotation**: filenames must follow the `__.pdb` convention so the pipeline knows which chain is the peptide. - **No uncertainty calibration**: raw softmax probabilities are not calibrated; use them as relative scores rather than absolute confidences. - **Single peptide-protein pair per call**: multi-peptide or multi-chain contexts are not supported in the standard pipeline. ## Citation If you use GeoPep in your work, please cite: ```bibtex @misc{geopep2026, title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction}, author = {Chen, Dian}, year = {2026}, howpublished = {\url{https://github.com/Dian0212/GeoPep}} } ``` ## License MIT