File size: 5,643 Bytes
3b9afc6
 
 
 
 
 
 
 
 
 
 
e35871f
3b9afc6
 
 
e35871f
 
 
 
 
3b9afc6
 
 
 
 
 
 
 
 
 
 
 
 
 
e35871f
 
 
 
 
 
 
 
3b9afc6
 
 
e35871f
3b9afc6
e35871f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b9afc6
 
 
 
 
 
 
 
 
 
e35871f
 
 
 
3b9afc6
e35871f
3b9afc6
e35871f
 
3b9afc6
 
 
e35871f
 
 
 
 
 
 
 
3b9afc6
 
e35871f
3b9afc6
 
 
e35871f
3b9afc6
 
 
 
 
 
e35871f
 
 
 
8e6fe1f
e35871f
3b9afc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
library_name: pytorch
language:
  - en
tags:
  - protein-pocket-detection
  - esm2
  - binding-site-prediction
---

# PockNet – Fusion Transformer (Selective SWA, multi-seed release)

## Model Summary

- **Architecture:** Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.  
- **Checkpoint:** `selective_swa_epoch09_12.ckpt` (stochastic weight averaged blend of epochs 20–30).  
- **Evaluation:** Release metrics aggregate **five** independently-seeded SWA runs; per-seed artefacts live under `outputs/final_seed_sweep/`.  
- **Input:** Optimised H5 datasets from `run_h5_generation_optimized.sh` (`tabular`, `esm`, `neighbour` tensors).  
- **Output:** Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.

## Intended Use & Limitations

| Intended Use | Notes |
|--------------|-------|
| Structure-based binding-pocket detection for academic or non-commercial research | Designed to reproduce and extend P2Rank experiments using BU48 and related datasets |
| Evaluation via the provided `auto-run` / `predict-dataset` orchestration | Ensures calibration, clustering, and reporting match the release scripts |

**Limitations**
- Trained on BU48-style protein chains with solvent-accessible surface sampling; transfer to radically different proteins is unverified.
- Requires pretrained ESM2-3B embeddings; ensure consistent preprocessing (chain-level `.pt` files) for best results.

## Training Data & Procedure

- **Datasets:** Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under `data/p2rank-datasets/joined/*`) aggregated in `data/all_train.ds`. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.  
- **Features:** `src/datagen/extract_protein_features.py` (tabular descriptors) + `src/datagen/merge_chainfix_complete.py`.  
- **Embeddings:** `src/tools/generate_esm2_embeddings.py` (ESM2_t36_3B_UR50D).  
- **H5 assembly:** `run_h5_generation_optimized.sh` → `data/h5/all_train_transformer_v2_optimized.h5` with neighbour tensors and split labels.  
- **Training:** Preferred via `python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ...`.  
- **Multi-seed sweep:** Seeds `{13, 21, 34, 55, 89}` plus the reference `2025` run; SWA averages checkpoints from epochs 20–30.  
- **Hardware:** 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.  
- **Logging:** PyTorch Lightning 2.5 + Hydra 1.3, W&B project `fusion_pocknet_thesis`.

## Metrics

### Point-level (single-seed SWA checkpoint)

| Metric | Value | Split |
| --- | --- | --- |
| IoU | 0.2950 | BU48 (test) |
| PR-AUC | 0.414 | BU48 (test) |
| ROC-AUC | 0.944 | BU48 (test) |

### Pocket-level (5-seed aggregated release, DBSCAN post-processing)

| Metric | Mean | 95 % CI | Notes |
| --- | --- | --- | --- |
| Mean IoU | 0.1276 | ±0.0124 | Average pocket IoU across BU48 |
| Best IoU (oracle) | 0.1580 | ±0.0141 | Max IoU per protein |
| GT Coverage | 0.8979 | ±0.0057 | Fraction of GT pockets matched |
| Avg pockets / protein | 6.37 | ±0.87 | Post-threshold pockets |

Success rates (DBSCAN, `eps=3.0`, `min_samples=5`, score threshold 0.91):

- **DCA success@1:** 75 %  
- **DCC success@1:** 39 %  
- **DCA success@3:** 89 %  
- **DCC success@3:** 50 %

Refer to `outputs/final_seed_sweep/*.csv` for the exact release numbers cited by
the thesis (Chapters 5–7 and Appendix 91).

## How to Use

### 1. Download with `huggingface_hub`
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
print(ckpt_path)  # local file path
```

### 2. Run the end-to-end pipeline (CLI / Docker)

Preferred CLI workflow:

```bash
python src/scripts/end_to_end_pipeline.py predict-dataset \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/bu48_release
```

Or inside Docker:
```bash
make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"
```

### 3. Single-protein inference

If you already have an H5 + vectors CSV and want to inspect a single structure:

```bash
python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/pocknet_single_1a4j
```

## Files Included in the Hugging Face Repo

- `selective_swa_epoch09_12.ckpt` – release checkpoint
- `MODEL_CARD.md` – this document

All supporting scripts (`src/scripts/end_to_end_pipeline.py`, Dockerfile,
data-generation tooling, notebooks) and artefacts (`outputs/final_seed_sweep/*`,
figures, thesis sources) remain in the public GitHub repository:
<https://github.com/lal3lu03/PockNet>. Refer there for full reproducibility
instructions, figures, and provenance logs.

## Citation

If you use PockNet in your work, please cite:

```
@misc{lal3lu03_pocknet_2025,
  title   = {PockNet Fusion Transformer Release},
  author  = {Hageneder, Max},
  year    = {2025},
  url     = {https://huggingface.co/lal3lu03/PockNet}
}
```

## License

Apache License 2.0. Refer to the repository `LICENSE` for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.