PockNet / README.md

Update README.md

8e6fe1f verified 19 days ago

5.64 kB

	---
	license: apache-2.0
	library_name: pytorch
	language:
	- en
	tags:
	- protein-pocket-detection
	- esm2
	- binding-site-prediction
	---

	# PockNet – Fusion Transformer (Selective SWA, multi-seed release)

	## Model Summary

	- Architecture: Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.
	- Checkpoint: `selective_swa_epoch09_12.ckpt` (stochastic weight averaged blend of epochs 20–30).
	- Evaluation: Release metrics aggregate five independently-seeded SWA runs; per-seed artefacts live under `outputs/final_seed_sweep/`.
	- Input: Optimised H5 datasets from `run_h5_generation_optimized.sh` (`tabular`, `esm`, `neighbour` tensors).
	- Output: Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.

	## Intended Use & Limitations

	\| Intended Use \| Notes \|
	\|--------------\|-------\|
	\| Structure-based binding-pocket detection for academic or non-commercial research \| Designed to reproduce and extend P2Rank experiments using BU48 and related datasets \|
	\| Evaluation via the provided `auto-run` / `predict-dataset` orchestration \| Ensures calibration, clustering, and reporting match the release scripts \|

	Limitations
	- Trained on BU48-style protein chains with solvent-accessible surface sampling; transfer to radically different proteins is unverified.
	- Requires pretrained ESM2-3B embeddings; ensure consistent preprocessing (chain-level `.pt` files) for best results.

	## Training Data & Procedure

	- Datasets: Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under `data/p2rank-datasets/joined/*`) aggregated in `data/all_train.ds`. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.
	- Features: `src/datagen/extract_protein_features.py` (tabular descriptors) + `src/datagen/merge_chainfix_complete.py`.
	- Embeddings: `src/tools/generate_esm2_embeddings.py` (ESM2_t36_3B_UR50D).
	- H5 assembly: `run_h5_generation_optimized.sh` → `data/h5/all_train_transformer_v2_optimized.h5` with neighbour tensors and split labels.
	- Training: Preferred via `python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ...`.
	- Multi-seed sweep: Seeds `{13, 21, 34, 55, 89}` plus the reference `2025` run; SWA averages checkpoints from epochs 20–30.
	- Hardware: 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.
	- Logging: PyTorch Lightning 2.5 + Hydra 1.3, W&B project `fusion_pocknet_thesis`.

	## Metrics

	### Point-level (single-seed SWA checkpoint)

	\| Metric \| Value \| Split \|
	\| --- \| --- \| --- \|
	\| IoU \| 0.2950 \| BU48 (test) \|
	\| PR-AUC \| 0.414 \| BU48 (test) \|
	\| ROC-AUC \| 0.944 \| BU48 (test) \|

	### Pocket-level (5-seed aggregated release, DBSCAN post-processing)

	\| Metric \| Mean \| 95 % CI \| Notes \|
	\| --- \| --- \| --- \| --- \|
	\| Mean IoU \| 0.1276 \| ±0.0124 \| Average pocket IoU across BU48 \|
	\| Best IoU (oracle) \| 0.1580 \| ±0.0141 \| Max IoU per protein \|
	\| GT Coverage \| 0.8979 \| ±0.0057 \| Fraction of GT pockets matched \|
	\| Avg pockets / protein \| 6.37 \| ±0.87 \| Post-threshold pockets \|

	Success rates (DBSCAN, `eps=3.0`, `min_samples=5`, score threshold 0.91):

	- DCA success@1: 75 %
	- DCC success@1: 39 %
	- DCA success@3: 89 %
	- DCC success@3: 50 %

	Refer to `outputs/final_seed_sweep/*.csv` for the exact release numbers cited by
	the thesis (Chapters 5–7 and Appendix 91).

	## How to Use

	### 1. Download with `huggingface_hub`
	```python
	from huggingface_hub import hf_hub_download
	ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
	print(ckpt_path) # local file path
	```

	### 2. Run the end-to-end pipeline (CLI / Docker)

	Preferred CLI workflow:

	```bash
	python src/scripts/end_to_end_pipeline.py predict-dataset \
	--checkpoint /path/to/selective_swa_epoch09_12.ckpt \
	--h5 data/h5/all_train_transformer_v2_optimized.h5 \
	--csv data/vectorsTrain_all_chainfix.csv \
	--output outputs/bu48_release
	```

	Or inside Docker:
	```bash
	make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"
	```

	### 3. Single-protein inference

	If you already have an H5 + vectors CSV and want to inspect a single structure:

	```bash
	python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
	--checkpoint /path/to/selective_swa_epoch09_12.ckpt \
	--h5 data/h5/all_train_transformer_v2_optimized.h5 \
	--csv data/vectorsTrain_all_chainfix.csv \
	--output outputs/pocknet_single_1a4j
	```

	## Files Included in the Hugging Face Repo

	- `selective_swa_epoch09_12.ckpt` – release checkpoint
	- `MODEL_CARD.md` – this document

	All supporting scripts (`src/scripts/end_to_end_pipeline.py`, Dockerfile,
	data-generation tooling, notebooks) and artefacts (`outputs/final_seed_sweep/*`,
	figures, thesis sources) remain in the public GitHub repository:
	<https://github.com/lal3lu03/PockNet>. Refer there for full reproducibility
	instructions, figures, and provenance logs.

	## Citation

	If you use PockNet in your work, please cite:

	```
	@misc{lal3lu03_pocknet_2025,
	title = {PockNet Fusion Transformer Release},
	author = {Hageneder, Max},
	year = {2025},
	url = {https://huggingface.co/lal3lu03/PockNet}
	}
	```

	## License

	Apache License 2.0. Refer to the repository `LICENSE` for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.