geopep / README.md

Update README.md

97248ad verified 10 days ago

5.49 kB

	---
	license: mit
	language:
	- en
	library_name: pytorch
	tags:
	- biology
	- proteins
	- peptides
	- binding-site-prediction
	- esm3
	- kan
	- protein-language-model
	- drug-design
	pipeline_tag: token-classification
	---

	# GeoPep

	Geometric-aware Peptide-Protein Binding Site Prediction

	GeoPep is a per-residue binding site predictor that combines the ESM3 protein
	foundation model with Kolmogorov-Arnold Network (KAN) heads. Given a peptide
	and a protein, it predicts which residues of the protein bind the peptide and
	which residues of the peptide make contact.

	- 📦 Code: https://github.com/Dian0212/GeoPep
	- 📄 Checkpoint: `model_distanceLoss.ckpt` (~16 GB)

	## Model Summary

	\| \| \|
	\|---\|---\|
	\| Backbone \| ESM3 (sm-open-v1, 1.4B params) \|
	\| Head \| 5 stacked FastKAN layers, 1536 → 1153 → 770 → 387 → 3 → 3 \|
	\| Granularity \| Per-residue (one prediction per amino acid) \|
	\| Classes \| 3 — `0`: non-interface, `1`: interface (binding), `2`: padding \|
	\| Input \| Peptide (≤50 residues) `\|` Protein (≤500 residues), 551 tokens total \|
	\| Output \| Logits `[B, 3, 551]` → softmax binding probability per residue \|
	\| Training loss \| Weighted cross-entropy + differentiable geometric distance loss \|

	## How to Use

	### 1. Install the GeoPep code

	```bash
	git clone https://github.com/Dian0212/GeoPep.git
	cd GeoPep
	conda create -n geopep python=3.10 -y
	conda activate geopep
	pip install -r requirements.txt
	```

	### 2. Download the checkpoint from this repo

	```bash
	huggingface-cli download dchenqwer/GeoPep model_distanceLoss.ckpt \
	--local-dir model_weights/ --local-dir-use-symlinks False
	```

	This places the file at `model_weights/model_distanceLoss.ckpt`.

	### 3. Run inference on your PDBs

	Put PDB files (named `<PDBID>_<peptide_chain>_<protein_chain>.pdb`) into a
	folder, then:

	```bash
	cd scripts
	python inference_pipeline.py \
	--pdb-dir /path/to/pdb \
	--checkpoint ../model_weights/model_distanceLoss.ckpt
	```

	Output: `result/predictions.json` with per-residue binding probabilities.

	### 4. Result format

	```json
	{
	"1a1r_C_A": {
	"peptide_chain": "GSVVIVGRIVLSGKPA",
	"protein_chain": "VEGEVQIVSTATQTFLAT...",
	"peptide_bindingProbability": "0.99 0.99 0.99 ...",
	"protein_bindingProbability": "0.35 0.97 0.12 ..."
	}
	}
	```

	- `peptide_chain` / `protein_chain` — actual residue sequences (padding stripped).
	- `*_bindingProbability` — space-separated 2-decimal floats, one per residue.
	Value is the raw class-1 (interface) softmax probability.

	## Architecture

	```
	input tokens [B, 553] BOS + 551 residues + EOS
	│
	ESM3 encoder
	↓ embeddings[:, 1:552, :] drop BOS / EOS
	[B, 551, 1536] per-residue embeddings
	↓ reshape
	[B * 551, 1536] each residue independent
	↓
	KAN_1: 1536 → 1153
	KAN_2: 1153 → 770
	KAN_3: 770 → 387
	KAN_4: 387 → 3
	KAN_5: 3 → 3
	↓ reshape + permute
	logits [B, 3, 551]
	↓ softmax(dim=1)
	binding probability = softmax[:, 1, :]
	```

	The per-residue head is what makes GeoPep different from the original
	CLS-token approach — every residue's embedding flows independently through
	the KAN stack, giving sharper position-level predictions.

	## Training

	- Dataset: peptide–protein complexes from the PDB, encoded as
	paired `complex/` (full structure) + `interface/` (binding residues only)
	PDB files.
	- Length filters: peptide ∈ [10, 50] residues, protein ∈ [10, 500] residues.
	- Loss:
	- Per-half cross-entropy with class weights `[0.2, 0.8, 0.0]`
	(padding contributes zero gradient).
	- Differentiable distance loss
	`L_dist = Σᵢ P_binding(i) · dist(i) / num_valid_residues`,
	which penalizes high binding probability at residues far from the true
	interface (distance computed from residue centers of mass).
	- Backbone: ESM3 backbone is fine-tuned together with the KAN head.
	- Optimizer: AdamW, learning rate 1e-4, weight decay 1e-4.
	- Mixed precision: FP16.

	## Input Format

	The model expects a single concatenated string of length 551 tokens:

	```
	PEPTIDESEQ<pad><pad>...\|PROTEINSEQ<pad><pad>...
	\|<------- 50 ------->\|<-------- 500 ----------->\|
	```

	\| Position \| Content \|
	\|---\|---\|
	\| 0 .. 49 \| Peptide (left-padded with `<pad>` to 50) \|
	\| 50 \| Separator `\|` \|
	\| 51 .. 550 \| Protein (left-padded with `<pad>` to 500) \|

	The `inference_pipeline.py` script handles padding and tokenization for you —
	you only provide raw PDB files.

	## Limitations

	- Sequence-length cap: peptide must be ≤ 50 residues, protein ≤ 500
	residues. Sequences longer than this are silently skipped at the
	preprocessing stage.
	- Requires per-PDB chain annotation: filenames must follow the
	`<PDBID>_<peptide_chain>_<protein_chain>.pdb` convention so the pipeline
	knows which chain is the peptide.
	- No uncertainty calibration: raw softmax probabilities are not calibrated;
	use them as relative scores rather than absolute confidences.
	- Single peptide-protein pair per call: multi-peptide or multi-chain
	contexts are not supported in the standard pipeline.

	## Citation

	If you use GeoPep in your work, please cite:

	```bibtex
	@misc{geopep2026,
	title = {GeoPep: Geometric-aware Peptide-Protein Binding Site Prediction},
	author = {Chen, Dian},
	year = {2026},
	howpublished = {\url{https://github.com/Dian0212/GeoPep}}
	}
	```

	## License

	MIT