DeepPharm / README.md

Upload README.md with huggingface_hub

f84bd3a verified 10 days ago

3.91 kB

	---
	language:
	- en
	license: mit
	tags:
	- drug-discovery
	- binding-affinity
	- protein-ligand
	- graph-neural-network
	- esm2
	- drug-repurposing
	- multimodal
	- transfer-learning
	datasets:
	- pdbbind-v2020
	metrics:
	- rmse
	- pearsonr
	pipeline_tag: other
	---

	# DeepPharm: Multi-Modal Transfer Learning for Drug-Target Affinity Prediction

	## Model Description

	DeepPharm is a multi-modal deep learning framework for predicting protein–ligand binding affinity ($pK$). It combines:

	- GATv2 molecular graph encoder (3 layers, 4 heads)
	- ECFP4 fingerprint MLP encoder (2048→128)
	- Gated Fusion mechanism for adaptive ligand representation
	- ESM-2 protein language model (150M params, fine-tuned)
	- Stacked Cross-Attention (2 layers, 4 heads) for drug-protein interaction
	- Residual Prediction Head with SiLU activation

	### Two Modes of Operation

	\| Mode \| Task \| Input \| Output \|
	\|------\|------\|-------\|--------\|
	\| Mode A \| Supervised affinity prediction \| Drug SMILES + Protein sequence \| pK value \|
	\| Mode B \| Weakly supervised drug repurposing \| Drug + Disease signature \| Ranked candidates \|

	## Performance

	### Systematic Ablation (PDBbind v2020, $N_{test}=3{,}775$)

	\| Config \| RMSE ↓ \| Pearson ↑ \| Spearman ↑ \|
	\|--------\|--------\|-----------\|------------\|
	\| V1 Baseline (ESM-35M) \| 1.266 \| 0.743 \| 0.743 \|
	\| V2 Architecture \| 1.258 \| 0.748 \| 0.746 \|
	\| V2 + CosineWR \| 1.244 \| 0.753 \| 0.750 \|
	\| V2 + ESM-150M (Best) \| 1.229 \| 0.762 \| 0.760 \|
	\| V2 + EMA \| 1.247 \| 0.753 \| 0.753 \|

	### Five-Seed Ensemble (Best Configuration)

	\| Metric \| Mean ± Std \|
	\|--------\|-----------\|
	\| RMSE \| 1.246 ± 0.005 \|
	\| Pearson r \| 0.751 ± 0.002 \|
	\| Spearman ρ \| 0.750 ± 0.002 \|

	CV < 0.4% confirms high reproducibility.

	### Baselines (all re-implemented on same split)

	\| Model \| RMSE ↓ \| Pearson ↑ \|
	\|-------\|--------\|-----------\|
	\| DeepDTA (CNN) \| 1.48 \| 0.61 \|
	\| GraphDTA (GCN) \| 1.39 \| 0.67 \|
	\| MolCLR* \| 1.30 \| 0.74 \|
	\| DrugBAN \| 1.28 \| 0.76 \|
	\| DeepPharm V2 \| 1.23 \| 0.76 \|

	## Intended Use

	- High-throughput virtual screening of drug candidates
	- Binding affinity prediction for drug-target pairs
	- Hypothesis generation for drug repurposing in orphan diseases
	- Research and academic purposes

	## Limitations

	- 2D topological encoder; cannot distinguish stereoisomers
	- Trained on PDBbind v2020, which overrepresents kinases
	- Mode B uses drug priors (guilt-by-association), not zero-shot inference
	- Predictions require experimental validation

	## Training Details

	- Dataset: PDBbind v2020 General Set (15,100 train / 3,775 test, seed=42)
	- Hardware: 1× NVIDIA H100 80 GB
	- Optimizer: AdamW (backbone LR: 5e-6, head LR: 8e-4)
	- Scheduler: CosineAnnealing with Warm Restarts ($T_0$=10, $T_{mult}$=2)
	- Loss: MSE + 0.3·RankingLoss + 0.2·HuberLoss
	- Training time: ~11 min/epoch (ESM-2 150M), best checkpoint at epoch 18

	## Available Checkpoints

	\| File \| Description \| RMSE \|
	\|------\|-------------\|------\|
	\| `best_v2_esm150m.pt` \| Best V2 model (ESM-2 150M) \| 1.229 \|
	\| `best_v1_esm35m.pt` \| V1 Baseline (ESM-2 35M) \| 1.266 \|

	## How to Use

	```python
	from huggingface_hub import hf_hub_download

	# Download the best model
	path = hf_hub_download("chamoso/DeepPharm", "best_v2_esm150m.pt")

	# Load in PyTorch
	import torch
	checkpoint = torch.load(path, map_location="cpu")
	```

	For full inference with data preprocessing:

	```bash
	git clone https://github.com/chamoso/DeepPharm.git
	cd DeepPharm
	python scripts/predict.py \
	--checkpoint weights/best_v2_esm150m.pt \
	--smiles "CC(=O)Oc1ccccc1C(=O)O" \
	--sequence "MKTAYIAKQRQISFVKSHFSRQLE..."
	```

	## Links

	- GitHub: [chamoso/DeepPharm](https://github.com/chamoso/DeepPharm)
	- Live Demo: [HuggingFace Spaces](https://huggingface.co/spaces/chamoso/DeepPharm)

	## Citation

	Preprint coming soon.

	## License

	MIT License