MillerBind v9 and v12 - TDC Peer-Review Validation

Abstract

We report an independent validation of MillerBind v9 and v12, proprietary structure-based scoring functions for protein-ligand binding affinity prediction, using the Therapeutics Data Commons (TDC) evaluation framework (PyTDC v1.1.15). On the CASF-2016 benchmark (n=285, strictly held out), MillerBind v9 achieves PCC = 0.890 (95% CI: 0.862–0.912) and MAE = 0.780 pKd (95% CI: 0.708–0.857). MillerBind v12 achieves PCC = 0.938 (95% CI: 0.921–0.950) and MAE = 0.637 pKd (95% CI: 0.571–0.707). Both models significantly outperform all published structure-based scoring functions (all p < 10⁻⁹⁸). v12 significantly improves over v9 (paired t-test: t = 5.30, p = 2.4×10⁻⁷). Cross-reference analysis with TDC BindingDB_Kd (52,274 entries) demonstrates 49.5% structural coverage, and performance on TDC-overlapping targets remains strong (PCC = 0.880, n=170).

1. Introduction

Accurate prediction of protein-ligand binding affinity is fundamental to structure-based drug discovery. Scoring functions estimate binding free energy (ΔG) or the dissociation constant pK_d = −log₁₀(K_d) from three-dimensional protein-ligand complex structures. The CASF-2016 benchmark (Su et al., 2019) is the accepted standard for evaluating scoring power, providing 285 diverse complexes spanning pK_d 2.0–12.0.

MillerBind is a family of proprietary machine-learning scoring functions developed by BindStream Technologies. To ensure independent, reproducible validation, we evaluate both the general-purpose model (v9) and the hard-target model (v12) using the Therapeutics Data Commons (TDC) evaluation framework (Huang et al., 2021), a community-standard platform for benchmarking ML models in drug discovery.

2. Model Overview

2.1 MillerBind v9 (General-Purpose)

2.2 MillerBind v12 (Hard Targets / PPI)

3. Evaluation Methods

3.1 TDC Evaluation Framework

All metrics were computed using tdc.Evaluator from PyTDC v1.1.15 (Huang et al., 2021). Metrics: Pearson correlation coefficient (PCC), Spearman rank correlation (ρ), mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²).

3.2 Statistical Analysis

95% confidence intervals were computed via bootstrap resampling (1,000 iterations, seed=42). Statistical significance of PCC was assessed by the Pearson test p-value. Comparison between v9 and v12 used a two-sided paired t-test on absolute prediction errors. Per-affinity-range analysis stratified complexes into five bins: Weak (2–4), Low (4–6), Medium (6–8), High (8–10), and Very High (10–12) pK_d.

3.3 TDC BindingDB Cross-Reference

TDC BindingDB_Kd (52,274 entries, 1,090 UniProt targets) was cross-referenced with PDBbind refined-set (19,037 complexes) via the UniProt REST API. We evaluated MillerBind on the CASF-2016 complexes whose targets also appear in TDC BindingDB.

4. Results

4.1 CASF-2016 Performance with Confidence Intervals

Table 1. TDC-validated performance on CASF-2016 (n=285). 95% CI from 1,000 bootstrap resamples.

Model	PCC [95% CI]	Spearman ρ	MAE [95% CI]	RMSE	R²	p-value
v9 Ensemble	0.890 [0.862, 0.912]	0.877	0.780 [0.708, 0.857]	1.030	0.775	< 10⁻⁹⁸
v12 Best	0.938 [0.921, 0.950]	0.960	0.637 [0.571, 0.707]	0.869	0.840	< 10⁻¹³¹

v12 significantly improves over v9: paired t-test on |error|, t = 5.30, p = 2.4×10⁻⁷.

4.2 Predicted vs. Experimental Scatter Plots

4.3 Residual Distributions

4.4 Comparison with Published Methods

Table 2. CASF-2016 scoring power. Published results from original papers. MillerBind validated via TDC Evaluator.

4.5 Per-Affinity-Range Analysis

Method	PCC	MAE (pKd)	Type	Year
AutoDock Vina [6]	0.604	2.05	Physics-based	2010
X-Score	0.614	1.98	Empirical	2002
RF-Score v3	0.800	1.40	Random Forest	2015
Pafnucy	0.780	1.42	3D-CNN	2018
OnionNet-2	0.816	1.28	Deep Learning	2021
PIGNet	0.830	1.21	GNN	2022
IGN	0.850	1.15	GNN	2021
HAC-Net	0.860	1.10	DL Ensemble	2023
MillerBind v9 *	0.890	0.780	Proprietary ML	2025
MillerBind v12 *	0.938	0.637	Proprietary ML	2025

Range (pKd)	n	v9 PCC	v9 MAE	v12 PCC	v12 MAE
Weak (2–4)	39	0.462	0.629	0.852	0.431
Low (4–6)	79	0.813	0.702	0.890	0.522
Medium (6–8)	92	0.402	0.481	0.680	0.322
High (8–10)	58	0.372	1.056	0.576	0.968
Very High (10–12)	17	0.698	2.169	0.514	2.225

v12 shows particular strength in the weak and low-affinity ranges (PCC 0.852 and 0.890 respectively), which are traditionally the most challenging for scoring functions. Both models show increased error at very high affinities (pK_d > 10), consistent with the known regression-to-mean effect in this regime.

4.6 TDC BindingDB Cross-Reference

Performance on TDC-overlapping targets (170 complexes) shows minimal degradation (ΔPCC = −0.010), confirming generalization to TDC-relevant drug targets.

5. Discussion

The TDC-validated results confirm that MillerBind v9 and v12 achieve state-of-the-art scoring power on CASF-2016. Several observations merit discussion:

6. Conclusion

References

Metric	Full CASF-2016 (n=285)	TDC-Overlapping (n=170)	Δ
PCC	0.890	0.880	−0.010
Spearman ρ	0.877	0.857	−0.020
MAE (pKd)	0.780	0.808	+0.028
RMSE	1.030	1.085	+0.055
R²	0.775	0.744	−0.031

[1] Huang, K., Fu, T., Gao, W., et al. (2021). Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. NeurIPS Datasets and Benchmarks.

[2] Li, Y., Su, M., Liu, Z., et al. (2019). Assessing protein-ligand interaction scoring functions with the CASF-2016 benchmark. Protein Science, 28(2), 293–301.

[3] Su, M., Yang, Q., Du, Y., et al. (2019). Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model., 59(2), 895–913.

[4] Wang, R., Fang, X., Lu, Y., & Wang, S. (2004). The PDBbind Database. J. Med. Chem., 47(12), 2977–2980.

[5] Stepniewska-Dziubinska, M.M., Zielenkiewicz, P., & Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics, 34(21), 3666–3674.

[6] Trott, O. & Olson, A.J. (2010). AutoDock Vina: Improving the Speed and Accuracy of Docking. J. Comput. Chem., 31(2), 455–461.

[7] Moon, S., Zhung, W., Yang, S., Lim, J., & Kim, W.Y. (2022). PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions. Chem. Sci., 13, 3661–3673.

[8] Jiang, D., Hsieh, C.Y., Wu, Z., et al. (2021). InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein-Ligand Interaction Predictions. J. Med. Chem., 64(24), 18209–18232.

[9] Kyro, G.W., Brent, R.I., & Batista, V.S. (2023). HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction. J. Chem. Inf. Model., 63(6), 1868–1878.

MillerBind v9 & v12: Independent Validation Using the Therapeutics Data Commons