File size: 3,911 Bytes
f84bd3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language:
  - en
license: mit
tags:
  - drug-discovery
  - binding-affinity
  - protein-ligand
  - graph-neural-network
  - esm2
  - drug-repurposing
  - multimodal
  - transfer-learning
datasets:
  - pdbbind-v2020
metrics:
  - rmse
  - pearsonr
pipeline_tag: other
---

# DeepPharm: Multi-Modal Transfer Learning for Drug-Target Affinity Prediction

## Model Description

**DeepPharm** is a multi-modal deep learning framework for predicting protein–ligand binding affinity ($pK$). It combines:

- **GATv2** molecular graph encoder (3 layers, 4 heads)
- **ECFP4** fingerprint MLP encoder (2048→128)
- **Gated Fusion** mechanism for adaptive ligand representation
- **ESM-2** protein language model (150M params, fine-tuned)
- **Stacked Cross-Attention** (2 layers, 4 heads) for drug-protein interaction
- **Residual Prediction Head** with SiLU activation

### Two Modes of Operation

| Mode | Task | Input | Output |
|------|------|-------|--------|
| **Mode A** | Supervised affinity prediction | Drug SMILES + Protein sequence | pK value |
| **Mode B** | Weakly supervised drug repurposing | Drug + Disease signature | Ranked candidates |

## Performance

### Systematic Ablation (PDBbind v2020, $N_{test}=3{,}775$)

| Config | RMSE ↓ | Pearson ↑ | Spearman ↑ |
|--------|--------|-----------|------------|
| V1 Baseline (ESM-35M) | 1.266 | 0.743 | 0.743 |
| V2 Architecture | 1.258 | 0.748 | 0.746 |
| V2 + CosineWR | 1.244 | 0.753 | 0.750 |
| **V2 + ESM-150M (Best)** | **1.229** | **0.762** | **0.760** |
| V2 + EMA | 1.247 | 0.753 | 0.753 |

### Five-Seed Ensemble (Best Configuration)

| Metric | Mean ± Std |
|--------|-----------|
| RMSE | 1.246 ± 0.005 |
| Pearson r | 0.751 ± 0.002 |
| Spearman ρ | 0.750 ± 0.002 |

CV < 0.4% confirms high reproducibility.

### Baselines (all re-implemented on same split)

| Model | RMSE ↓ | Pearson ↑ |
|-------|--------|-----------|
| DeepDTA (CNN) | 1.48 | 0.61 |
| GraphDTA (GCN) | 1.39 | 0.67 |
| MolCLR* | 1.30 | 0.74 |
| DrugBAN | 1.28 | 0.76 |
| **DeepPharm V2** | **1.23** | **0.76** |

## Intended Use

- High-throughput virtual screening of drug candidates
- Binding affinity prediction for drug-target pairs
- Hypothesis generation for drug repurposing in orphan diseases
- Research and academic purposes

## Limitations

- 2D topological encoder; cannot distinguish stereoisomers
- Trained on PDBbind v2020, which overrepresents kinases
- Mode B uses drug priors (guilt-by-association), not zero-shot inference
- Predictions require experimental validation

## Training Details

- **Dataset:** PDBbind v2020 General Set (15,100 train / 3,775 test, seed=42)
- **Hardware:** 1× NVIDIA H100 80 GB
- **Optimizer:** AdamW (backbone LR: 5e-6, head LR: 8e-4)
- **Scheduler:** CosineAnnealing with Warm Restarts ($T_0$=10, $T_{mult}$=2)
- **Loss:** MSE + 0.3·RankingLoss + 0.2·HuberLoss
- **Training time:** ~11 min/epoch (ESM-2 150M), best checkpoint at epoch 18

## Available Checkpoints

| File | Description | RMSE |
|------|-------------|------|
| `best_v2_esm150m.pt` | Best V2 model (ESM-2 150M) | 1.229 |
| `best_v1_esm35m.pt` | V1 Baseline (ESM-2 35M) | 1.266 |

## How to Use

```python
from huggingface_hub import hf_hub_download

# Download the best model
path = hf_hub_download("chamoso/DeepPharm", "best_v2_esm150m.pt")

# Load in PyTorch
import torch
checkpoint = torch.load(path, map_location="cpu")
```

For full inference with data preprocessing:

```bash
git clone https://github.com/chamoso/DeepPharm.git
cd DeepPharm
python scripts/predict.py \
    --checkpoint weights/best_v2_esm150m.pt \
    --smiles "CC(=O)Oc1ccccc1C(=O)O" \
    --sequence "MKTAYIAKQRQISFVKSHFSRQLE..."
```

## Links

- **GitHub:** [chamoso/DeepPharm](https://github.com/chamoso/DeepPharm)
- **Live Demo:** [HuggingFace Spaces](https://huggingface.co/spaces/chamoso/DeepPharm)

## Citation

*Preprint coming soon.*

## License

MIT License