lal3lu03 commited on
Commit
e35871f
·
verified ·
1 Parent(s): a58dee0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -31
README.md CHANGED
@@ -9,15 +9,15 @@ tags:
9
  - binding-site-prediction
10
  ---
11
 
12
- # PockNet – Selective SWA Epoch09_12
13
 
14
  ## Model Summary
15
 
16
- - **Architecture:** Fusion transformer combining tabular SAS descriptors with ESM2-3B residue embeddings
17
- - **Checkpoint:** `selective_swa_epoch09_12.ckpt` (SWA blend of epoch 09 baseline and epoch 12 finetune)
18
- - **Input:** H5 files generated via `generate_h5_v2_optimized.py` (contains tabular + ESM tensors)
19
- - **Output:** Residue-wise ligandability probabilities + pocket clusters (P2Rank-style CSV/visualisations)
20
- - **Tasks:** Protein binding-pocket detection / ligandability ranking
21
 
22
  ## Intended Use & Limitations
23
 
@@ -32,23 +32,43 @@ tags:
32
 
33
  ## Training Data & Procedure
34
 
35
- - **Datasets:** BU48 plus auxiliary P2Rank-style splits encoded via `.ds` manifests.
36
- - **Features:** Generated using `src/datagen/extract_protein_features.py` and merged with chain-fixed CSVs.
37
- - **Embeddings:** `generate_esm2_embeddings.py` (ESM2_t36_3B_UR50D) per chain.
38
- - **H5 assembly:** `generate_h5_v2_optimized.py` storing tabular features, embeddings, neighbour tensors, and split labels.
39
- - **Training:** `python src/train.py experiment=fusion_transformer_aggressive_oct17 ...`
40
- - **Checkpoint selection:** SWA blend (50/50) between epoch 09 baseline and epoch 12 finetune, validated on held-out BU48.
 
 
41
 
42
  ## Metrics
43
 
44
- | Metric | Value | Notes |
45
- |--------|-------|-------|
46
- | Validation AUPRC | ~0.31 | On BU48 validation split |
47
- | Test AUPRC | ~0.445 | Single-seed evaluation on BU48 test split |
48
- | DCA Success@1 | 75% | From P2Rank-like DBSCAN analysis |
49
- | DCC Success@1 | 39% | From P2Rank-like DBSCAN analysis |
50
 
51
- Refer to `outputs/pocknet_eval_run*/summary/summary.csv` for the exact values produced by the release pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## How to Use
54
 
@@ -59,30 +79,45 @@ ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
59
  print(ckpt_path) # local file path
60
  ```
61
 
62
- ### 2. Run the end-to-end pipeline (Docker / local)
 
 
 
63
  ```bash
64
- python src/scripts/end_to_end_pipeline.py auto-run data/bu48.ds \
65
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
 
 
66
  --output outputs/bu48_release
67
  ```
68
 
69
- The command creates all intermediate artefacts (`features/`, `embeddings/`, `h5/`) and writes pockets + metrics under `<output>/predictions`.
 
 
 
 
 
 
 
70
 
71
- ### 3. Direct dataset inference
72
- If you already have an H5 + vectors CSV:
73
  ```bash
74
- python src/scripts/end_to_end_pipeline.py predict-dataset \
75
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
76
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
77
  --csv data/vectorsTrain_all_chainfix.csv \
78
- --output outputs/pocknet_eval_cli
79
  ```
80
 
81
  ## Files Included in the Hugging Face Repo
82
 
83
  - `selective_swa_epoch09_12.ckpt` – release checkpoint
84
  - `MODEL_CARD.md` – this document
85
- - (Optional) auxiliary scripts / instructions for inference
 
 
 
 
 
86
 
87
  ## Citation
88
 
@@ -100,7 +135,3 @@ If you use PockNet in your work, please cite:
100
  ## License
101
 
102
  Apache License 2.0. Refer to the repository `LICENSE` for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.
103
- # PockNet – Selective SWA Epoch09_12
104
- ---
105
- license: apache-2.0
106
- # PockNet – Selective SWA Epoch09_12
 
9
  - binding-site-prediction
10
  ---
11
 
12
+ # PockNet – Fusion Transformer (Selective SWA, multi-seed release)
13
 
14
  ## Model Summary
15
 
16
+ - **Architecture:** Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.
17
+ - **Checkpoint:** `selective_swa_epoch09_12.ckpt` (stochastic weight averaged blend of epochs 20–30).
18
+ - **Evaluation:** Release metrics aggregate **five** independently-seeded SWA runs; per-seed artefacts live under `outputs/final_seed_sweep/`.
19
+ - **Input:** Optimised H5 datasets from `run_h5_generation_optimized.sh` (`tabular`, `esm`, `neighbour` tensors).
20
+ - **Output:** Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.
21
 
22
  ## Intended Use & Limitations
23
 
 
32
 
33
  ## Training Data & Procedure
34
 
35
+ - **Datasets:** Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under `data/p2rank-datasets/joined/*`) aggregated in `data/all_train.ds`. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.
36
+ - **Features:** `src/datagen/extract_protein_features.py` (tabular descriptors) + `src/datagen/merge_chainfix_complete.py`.
37
+ - **Embeddings:** `src/tools/generate_esm2_embeddings.py` (ESM2_t36_3B_UR50D).
38
+ - **H5 assembly:** `run_h5_generation_optimized.sh` `data/h5/all_train_transformer_v2_optimized.h5` with neighbour tensors and split labels.
39
+ - **Training:** Preferred via `python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ...`.
40
+ - **Multi-seed sweep:** Seeds `{13, 21, 34, 55, 89}` plus the reference `2025` run; SWA averages checkpoints from epochs 20–30.
41
+ - **Hardware:** 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.
42
+ - **Logging:** PyTorch Lightning 2.5 + Hydra 1.3, W&B project `fusion_pocknet_thesis`.
43
 
44
  ## Metrics
45
 
46
+ ### Point-level (single-seed SWA checkpoint)
 
 
 
 
 
47
 
48
+ | Metric | Value | Split |
49
+ | --- | --- | --- |
50
+ | IoU | 0.2950 | BU48 (test) |
51
+ | PR-AUC | 0.414 | BU48 (test) |
52
+ | ROC-AUC | 0.944 | BU48 (test) |
53
+
54
+ ### Pocket-level (5-seed aggregated release, DBSCAN post-processing)
55
+
56
+ | Metric | Mean | 95 % CI | Notes |
57
+ | --- | --- | --- | --- |
58
+ | Mean IoU | 0.1276 | ±0.0124 | Average pocket IoU across BU48 |
59
+ | Best IoU (oracle) | 0.1580 | ±0.0141 | Max IoU per protein |
60
+ | GT Coverage | 0.8979 | ±0.0057 | Fraction of GT pockets matched |
61
+ | Avg pockets / protein | 6.37 | ±0.87 | Post-threshold pockets |
62
+
63
+ Success rates (DBSCAN, `eps=3.0`, `min_samples=5`, score threshold 0.91):
64
+
65
+ - **DCA success@1:** 75 %
66
+ - **DCC success@1:** 39 %
67
+ - **DCA success@3:** 89 %
68
+ - **DCC success@3:** 50 %
69
+
70
+ Refer to `outputs/final_seed_sweep/*.csv` for the exact release numbers cited by
71
+ the thesis (Chapters 5–7 and Appendix 91).
72
 
73
  ## How to Use
74
 
 
79
  print(ckpt_path) # local file path
80
  ```
81
 
82
+ ### 2. Run the end-to-end pipeline (CLI / Docker)
83
+
84
+ Preferred CLI workflow:
85
+
86
  ```bash
87
+ python src/scripts/end_to_end_pipeline.py predict-dataset \
88
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
89
+ --h5 data/h5/all_train_transformer_v2_optimized.h5 \
90
+ --csv data/vectorsTrain_all_chainfix.csv \
91
  --output outputs/bu48_release
92
  ```
93
 
94
+ Or inside Docker:
95
+ ```bash
96
+ make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"
97
+ ```
98
+
99
+ ### 3. Single-protein inference
100
+
101
+ If you already have an H5 + vectors CSV and want to inspect a single structure:
102
 
 
 
103
  ```bash
104
+ python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
105
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
106
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
107
  --csv data/vectorsTrain_all_chainfix.csv \
108
+ --output outputs/pocknet_single_1a4j
109
  ```
110
 
111
  ## Files Included in the Hugging Face Repo
112
 
113
  - `selective_swa_epoch09_12.ckpt` – release checkpoint
114
  - `MODEL_CARD.md` – this document
115
+
116
+ All supporting scripts (`src/scripts/end_to_end_pipeline.py`, Dockerfile,
117
+ data-generation tooling, notebooks) and artefacts (`outputs/final_seed_sweep/*`,
118
+ figures, thesis sources) remain in the public GitHub repository:
119
+ <https://github.com/hageneder/PockNet>. Refer there for full reproducibility
120
+ instructions, figures, and provenance logs.
121
 
122
  ## Citation
123
 
 
135
  ## License
136
 
137
  Apache License 2.0. Refer to the repository `LICENSE` for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.