Structure-Based Protein-Ligand Binding Affinity Prediction
BindStream Technologies
February 26, 2026
We report an independent validation of MillerBind v9 and v12, proprietary structure-based scoring functions for protein-ligand binding affinity prediction, using the Therapeutics Data Commons (TDC) evaluation framework (PyTDC v1.1.15). On the CASF-2016 benchmark (n=285, strictly held out), MillerBind v9 achieves PCC = 0.890 (95% CI: 0.862–0.912) and MAE = 0.780 pKd (95% CI: 0.708–0.857). MillerBind v12 achieves PCC = 0.938 (95% CI: 0.921–0.950) and MAE = 0.637 pKd (95% CI: 0.571–0.707). Both models significantly outperform all published structure-based scoring functions (all p < 10−98). v12 significantly improves over v9 (paired t-test: t = 5.30, p = 2.4×10−7). Cross-reference analysis with TDC BindingDB_Kd (52,274 entries) demonstrates 49.5% structural coverage, and performance on TDC-overlapping targets remains strong (PCC = 0.880, n=170).
Accurate prediction of protein-ligand binding affinity is fundamental to structure-based drug discovery. Scoring functions estimate binding free energy (ΔG) or the dissociation constant pKd = −log10(Kd) from three-dimensional protein-ligand complex structures. The CASF-2016 benchmark (Su et al., 2019) is the accepted standard for evaluating scoring power, providing 285 diverse complexes spanning pKd 2.0–12.0.
MillerBind is a family of proprietary machine-learning scoring functions developed by BindStream Technologies. To ensure independent, reproducible validation, we evaluate both the general-purpose model (v9) and the hard-target model (v12) using the Therapeutics Data Commons (TDC) evaluation framework (Huang et al., 2021), a community-standard platform for benchmarking ML models in drug discovery.
All metrics were computed using tdc.Evaluator from PyTDC v1.1.15 (Huang et al., 2021). Metrics: Pearson correlation coefficient (PCC), Spearman rank correlation (ρ), mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²).
95% confidence intervals were computed via bootstrap resampling (1,000 iterations, seed=42). Statistical significance of PCC was assessed by the Pearson test p-value. Comparison between v9 and v12 used a two-sided paired t-test on absolute prediction errors. Per-affinity-range analysis stratified complexes into five bins: Weak (2–4), Low (4–6), Medium (6–8), High (8–10), and Very High (10–12) pKd.
TDC BindingDB_Kd (52,274 entries, 1,090 UniProt targets) was cross-referenced with PDBbind refined-set (19,037 complexes) via the UniProt REST API. We evaluated MillerBind on the CASF-2016 complexes whose targets also appear in TDC BindingDB.
Table 1. TDC-validated performance on CASF-2016 (n=285). 95% CI from 1,000 bootstrap resamples.
| Model | PCC [95% CI] | Spearman ρ | MAE [95% CI] | RMSE | R² | p-value |
|---|---|---|---|---|---|---|
| v9 Ensemble | 0.890 [0.862, 0.912] | 0.877 | 0.780 [0.708, 0.857] | 1.030 | 0.775 | < 10−98 |
| v12 Best | 0.938 [0.921, 0.950] | 0.960 | 0.637 [0.571, 0.707] | 0.869 | 0.840 | < 10−131 |
v12 significantly improves over v9: paired t-test on |error|, t = 5.30, p = 2.4×10−7.
Table 2. CASF-2016 scoring power. Published results from original papers. MillerBind validated via TDC Evaluator.
| Method | PCC | MAE (pKd) | Type | Year |
|---|---|---|---|---|
| AutoDock Vina [6] | 0.604 | 2.05 | Physics-based | 2010 |
| X-Score | 0.614 | 1.98 | Empirical | 2002 |
| RF-Score v3 | 0.800 | 1.40 | Random Forest | 2015 |
| Pafnucy | 0.780 | 1.42 | 3D-CNN | 2018 |
| OnionNet-2 | 0.816 | 1.28 | Deep Learning | 2021 |
| PIGNet | 0.830 | 1.21 | GNN | 2022 |
| IGN | 0.850 | 1.15 | GNN | 2021 |
| HAC-Net | 0.860 | 1.10 | DL Ensemble | 2023 |
| MillerBind v9 * | 0.890 | 0.780 | Proprietary ML | 2025 |
| MillerBind v12 * | 0.938 | 0.637 | Proprietary ML | 2025 |
* Validated using TDC Evaluator (PyTDC v1.1.15)
Table 3. Performance stratified by experimental pKd range.
| Range (pKd) | n | v9 PCC | v9 MAE | v12 PCC | v12 MAE |
|---|---|---|---|---|---|
| Weak (2–4) | 39 | 0.462 | 0.629 | 0.852 | 0.431 |
| Low (4–6) | 79 | 0.813 | 0.702 | 0.890 | 0.522 |
| Medium (6–8) | 92 | 0.402 | 0.481 | 0.680 | 0.322 |
| High (8–10) | 58 | 0.372 | 1.056 | 0.576 | 0.968 |
| Very High (10–12) | 17 | 0.698 | 2.169 | 0.514 | 2.225 |
v12 shows particular strength in the weak and low-affinity ranges (PCC 0.852 and 0.890 respectively), which are traditionally the most challenging for scoring functions. Both models show increased error at very high affinities (pKd > 10), consistent with the known regression-to-mean effect in this regime.
Table 4. MillerBind v9 on full vs. TDC-overlapping CASF-2016 subsets.
| Metric | Full CASF-2016 (n=285) | TDC-Overlapping (n=170) | Δ |
|---|---|---|---|
| PCC | 0.890 | 0.880 | −0.010 |
| Spearman ρ | 0.877 | 0.857 | −0.020 |
| MAE (pKd) | 0.780 | 0.808 | +0.028 |
| RMSE | 1.030 | 1.085 | +0.055 |
| R² | 0.775 | 0.744 | −0.031 |
Performance on TDC-overlapping targets (170 complexes) shows minimal degradation (ΔPCC = −0.010), confirming generalization to TDC-relevant drug targets.
The TDC-validated results confirm that MillerBind v9 and v12 achieve state-of-the-art scoring power on CASF-2016. Several observations merit discussion:
Using the TDC evaluation framework as an independent validator:
[1] Huang, K., Fu, T., Gao, W., et al. (2021). Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. NeurIPS Datasets and Benchmarks.
[2] Li, Y., Su, M., Liu, Z., et al. (2019). Assessing protein-ligand interaction scoring functions with the CASF-2016 benchmark. Protein Science, 28(2), 293–301.
[3] Su, M., Yang, Q., Du, Y., et al. (2019). Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model., 59(2), 895–913.
[4] Wang, R., Fang, X., Lu, Y., & Wang, S. (2004). The PDBbind Database. J. Med. Chem., 47(12), 2977–2980.
[5] Stepniewska-Dziubinska, M.M., Zielenkiewicz, P., & Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics, 34(21), 3666–3674.
[6] Trott, O. & Olson, A.J. (2010). AutoDock Vina: Improving the Speed and Accuracy of Docking. J. Comput. Chem., 31(2), 455–461.
[7] Moon, S., Zhung, W., Yang, S., Lim, J., & Kim, W.Y. (2022). PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions. Chem. Sci., 13, 3661–3673.
[8] Jiang, D., Hsieh, C.Y., Wu, Z., et al. (2021). InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein-Ligand Interaction Predictions. J. Med. Chem., 64(24), 18209–18232.
[9] Kyro, G.W., Brent, R.I., & Batista, V.S. (2023). HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction. J. Chem. Inf. Model., 63(6), 1868–1878.