File size: 5,643 Bytes
3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 3b9afc6 e35871f 8e6fe1f e35871f 3b9afc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: apache-2.0
library_name: pytorch
language:
- en
tags:
- protein-pocket-detection
- esm2
- binding-site-prediction
---
# PockNet – Fusion Transformer (Selective SWA, multi-seed release)
## Model Summary
- **Architecture:** Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.
- **Checkpoint:** `selective_swa_epoch09_12.ckpt` (stochastic weight averaged blend of epochs 20–30).
- **Evaluation:** Release metrics aggregate **five** independently-seeded SWA runs; per-seed artefacts live under `outputs/final_seed_sweep/`.
- **Input:** Optimised H5 datasets from `run_h5_generation_optimized.sh` (`tabular`, `esm`, `neighbour` tensors).
- **Output:** Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.
## Intended Use & Limitations
| Intended Use | Notes |
|--------------|-------|
| Structure-based binding-pocket detection for academic or non-commercial research | Designed to reproduce and extend P2Rank experiments using BU48 and related datasets |
| Evaluation via the provided `auto-run` / `predict-dataset` orchestration | Ensures calibration, clustering, and reporting match the release scripts |
**Limitations**
- Trained on BU48-style protein chains with solvent-accessible surface sampling; transfer to radically different proteins is unverified.
- Requires pretrained ESM2-3B embeddings; ensure consistent preprocessing (chain-level `.pt` files) for best results.
## Training Data & Procedure
- **Datasets:** Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under `data/p2rank-datasets/joined/*`) aggregated in `data/all_train.ds`. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.
- **Features:** `src/datagen/extract_protein_features.py` (tabular descriptors) + `src/datagen/merge_chainfix_complete.py`.
- **Embeddings:** `src/tools/generate_esm2_embeddings.py` (ESM2_t36_3B_UR50D).
- **H5 assembly:** `run_h5_generation_optimized.sh` → `data/h5/all_train_transformer_v2_optimized.h5` with neighbour tensors and split labels.
- **Training:** Preferred via `python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ...`.
- **Multi-seed sweep:** Seeds `{13, 21, 34, 55, 89}` plus the reference `2025` run; SWA averages checkpoints from epochs 20–30.
- **Hardware:** 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.
- **Logging:** PyTorch Lightning 2.5 + Hydra 1.3, W&B project `fusion_pocknet_thesis`.
## Metrics
### Point-level (single-seed SWA checkpoint)
| Metric | Value | Split |
| --- | --- | --- |
| IoU | 0.2950 | BU48 (test) |
| PR-AUC | 0.414 | BU48 (test) |
| ROC-AUC | 0.944 | BU48 (test) |
### Pocket-level (5-seed aggregated release, DBSCAN post-processing)
| Metric | Mean | 95 % CI | Notes |
| --- | --- | --- | --- |
| Mean IoU | 0.1276 | ±0.0124 | Average pocket IoU across BU48 |
| Best IoU (oracle) | 0.1580 | ±0.0141 | Max IoU per protein |
| GT Coverage | 0.8979 | ±0.0057 | Fraction of GT pockets matched |
| Avg pockets / protein | 6.37 | ±0.87 | Post-threshold pockets |
Success rates (DBSCAN, `eps=3.0`, `min_samples=5`, score threshold 0.91):
- **DCA success@1:** 75 %
- **DCC success@1:** 39 %
- **DCA success@3:** 89 %
- **DCC success@3:** 50 %
Refer to `outputs/final_seed_sweep/*.csv` for the exact release numbers cited by
the thesis (Chapters 5–7 and Appendix 91).
## How to Use
### 1. Download with `huggingface_hub`
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
print(ckpt_path) # local file path
```
### 2. Run the end-to-end pipeline (CLI / Docker)
Preferred CLI workflow:
```bash
python src/scripts/end_to_end_pipeline.py predict-dataset \
--checkpoint /path/to/selective_swa_epoch09_12.ckpt \
--h5 data/h5/all_train_transformer_v2_optimized.h5 \
--csv data/vectorsTrain_all_chainfix.csv \
--output outputs/bu48_release
```
Or inside Docker:
```bash
make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"
```
### 3. Single-protein inference
If you already have an H5 + vectors CSV and want to inspect a single structure:
```bash
python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
--checkpoint /path/to/selective_swa_epoch09_12.ckpt \
--h5 data/h5/all_train_transformer_v2_optimized.h5 \
--csv data/vectorsTrain_all_chainfix.csv \
--output outputs/pocknet_single_1a4j
```
## Files Included in the Hugging Face Repo
- `selective_swa_epoch09_12.ckpt` – release checkpoint
- `MODEL_CARD.md` – this document
All supporting scripts (`src/scripts/end_to_end_pipeline.py`, Dockerfile,
data-generation tooling, notebooks) and artefacts (`outputs/final_seed_sweep/*`,
figures, thesis sources) remain in the public GitHub repository:
<https://github.com/lal3lu03/PockNet>. Refer there for full reproducibility
instructions, figures, and provenance logs.
## Citation
If you use PockNet in your work, please cite:
```
@misc{lal3lu03_pocknet_2025,
title = {PockNet Fusion Transformer Release},
author = {Hageneder, Max},
year = {2025},
url = {https://huggingface.co/lal3lu03/PockNet}
}
```
## License
Apache License 2.0. Refer to the repository `LICENSE` for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.
|