MillerBind v9 & v12: Independent Validation Using the Therapeutics Data Commons

Structure-Based Protein-Ligand Binding Affinity Prediction

William Miller

BindStream Technologies

February 26, 2026

Abstract

We report an independent validation of MillerBind v9 and v12, proprietary structure-based scoring functions for protein-ligand binding affinity prediction, using the Therapeutics Data Commons (TDC) evaluation framework (PyTDC v1.1.15). On the CASF-2016 benchmark (n=285, strictly held out), MillerBind v9 achieves PCC = 0.890 (95% CI: 0.862–0.912) and MAE = 0.780 pKd (95% CI: 0.708–0.857). MillerBind v12 achieves PCC = 0.938 (95% CI: 0.921–0.950) and MAE = 0.637 pKd (95% CI: 0.571–0.707). Both models significantly outperform all published structure-based scoring functions (all p < 10−98). v12 significantly improves over v9 (paired t-test: t = 5.30, p = 2.4×10−7). Cross-reference analysis with TDC BindingDB_Kd (52,274 entries) demonstrates 49.5% structural coverage, and performance on TDC-overlapping targets remains strong (PCC = 0.880, n=170).

1. Introduction

Accurate prediction of protein-ligand binding affinity is fundamental to structure-based drug discovery. Scoring functions estimate binding free energy (ΔG) or the dissociation constant pKd = −log10(Kd) from three-dimensional protein-ligand complex structures. The CASF-2016 benchmark (Su et al., 2019) is the accepted standard for evaluating scoring power, providing 285 diverse complexes spanning pKd 2.0–12.0.

MillerBind is a family of proprietary machine-learning scoring functions developed by BindStream Technologies. To ensure independent, reproducible validation, we evaluate both the general-purpose model (v9) and the hard-target model (v12) using the Therapeutics Data Commons (TDC) evaluation framework (Huang et al., 2021), a community-standard platform for benchmarking ML models in drug discovery.

2. Model Overview

2.1 MillerBind v9 (General-Purpose)

2.2 MillerBind v12 (Hard Targets / PPI)

3. Evaluation Methods

3.1 TDC Evaluation Framework

All metrics were computed using tdc.Evaluator from PyTDC v1.1.15 (Huang et al., 2021). Metrics: Pearson correlation coefficient (PCC), Spearman rank correlation (ρ), mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²).

3.2 Statistical Analysis

95% confidence intervals were computed via bootstrap resampling (1,000 iterations, seed=42). Statistical significance of PCC was assessed by the Pearson test p-value. Comparison between v9 and v12 used a two-sided paired t-test on absolute prediction errors. Per-affinity-range analysis stratified complexes into five bins: Weak (2–4), Low (4–6), Medium (6–8), High (8–10), and Very High (10–12) pKd.

3.3 TDC BindingDB Cross-Reference

TDC BindingDB_Kd (52,274 entries, 1,090 UniProt targets) was cross-referenced with PDBbind refined-set (19,037 complexes) via the UniProt REST API. We evaluated MillerBind on the CASF-2016 complexes whose targets also appear in TDC BindingDB.

4. Results

4.1 CASF-2016 Performance with Confidence Intervals

Table 1. TDC-validated performance on CASF-2016 (n=285). 95% CI from 1,000 bootstrap resamples.

ModelPCC [95% CI]Spearman ρMAE [95% CI]RMSEp-value
v9 Ensemble 0.890 [0.862, 0.912] 0.877 0.780 [0.708, 0.857] 1.030 0.775 < 10−98
v12 Best 0.938 [0.921, 0.950] 0.960 0.637 [0.571, 0.707] 0.869 0.840 < 10−131

v12 significantly improves over v9: paired t-test on |error|, t = 5.30, p = 2.4×10−7.

4.2 Predicted vs. Experimental Scatter Plots

v9 scatter
Figure 1a. MillerBind v9 predictions vs. experimental pKd.
v12 scatter
Figure 1b. MillerBind v12 predictions vs. experimental pKd.

4.3 Residual Distributions

Residuals
Figure 2. Residual distributions (predicted − experimental). Both models show near-zero mean bias.

4.4 Comparison with Published Methods

Table 2. CASF-2016 scoring power. Published results from original papers. MillerBind validated via TDC Evaluator.

MethodPCCMAE (pKd)TypeYear
AutoDock Vina [6]0.6042.05Physics-based2010
X-Score0.6141.98Empirical2002
RF-Score v30.8001.40Random Forest2015
Pafnucy0.7801.423D-CNN2018
OnionNet-20.8161.28Deep Learning2021
PIGNet0.8301.21GNN2022
IGN0.8501.15GNN2021
HAC-Net0.8601.10DL Ensemble2023
MillerBind v9 *0.8900.780Proprietary ML2025
MillerBind v12 *0.9380.637Proprietary ML2025

* Validated using TDC Evaluator (PyTDC v1.1.15)

Key Finding: MillerBind v12 achieves PCC = 0.938 and MAE = 0.637 pKd, representing a 9.1% improvement in correlation and 42.1% reduction in error compared to the best published method (HAC-Net, PCC=0.860, MAE=1.10). Both results are statistically significant (p < 10−98).

4.5 Per-Affinity-Range Analysis

Table 3. Performance stratified by experimental pKd range.

Range (pKd)nv9 PCCv9 MAEv12 PCCv12 MAE
Weak (2–4)390.4620.6290.8520.431
Low (4–6)790.8130.7020.8900.522
Medium (6–8)920.4020.4810.6800.322
High (8–10)580.3721.0560.5760.968
Very High (10–12)170.6982.1690.5142.225

v12 shows particular strength in the weak and low-affinity ranges (PCC 0.852 and 0.890 respectively), which are traditionally the most challenging for scoring functions. Both models show increased error at very high affinities (pKd > 10), consistent with the known regression-to-mean effect in this regime.

4.6 TDC BindingDB Cross-Reference

509
TDC targets with PDBbind structures (46.7%)
8,384
PDBbind complexes matching TDC targets
49.5%
TDC dataset with structural coverage
170
CASF-2016 complexes in TDC targets

Table 4. MillerBind v9 on full vs. TDC-overlapping CASF-2016 subsets.

MetricFull CASF-2016 (n=285)TDC-Overlapping (n=170)Δ
PCC0.8900.880−0.010
Spearman ρ0.8770.857−0.020
MAE (pKd)0.7800.808+0.028
RMSE1.0301.085+0.055
0.7750.744−0.031

Performance on TDC-overlapping targets (170 complexes) shows minimal degradation (ΔPCC = −0.010), confirming generalization to TDC-relevant drug targets.

5. Discussion

The TDC-validated results confirm that MillerBind v9 and v12 achieve state-of-the-art scoring power on CASF-2016. Several observations merit discussion:

6. Conclusion

Using the TDC evaluation framework as an independent validator:

References

[1] Huang, K., Fu, T., Gao, W., et al. (2021). Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. NeurIPS Datasets and Benchmarks.

[2] Li, Y., Su, M., Liu, Z., et al. (2019). Assessing protein-ligand interaction scoring functions with the CASF-2016 benchmark. Protein Science, 28(2), 293–301.

[3] Su, M., Yang, Q., Du, Y., et al. (2019). Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model., 59(2), 895–913.

[4] Wang, R., Fang, X., Lu, Y., & Wang, S. (2004). The PDBbind Database. J. Med. Chem., 47(12), 2977–2980.

[5] Stepniewska-Dziubinska, M.M., Zielenkiewicz, P., & Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics, 34(21), 3666–3674.

[6] Trott, O. & Olson, A.J. (2010). AutoDock Vina: Improving the Speed and Accuracy of Docking. J. Comput. Chem., 31(2), 455–461.

[7] Moon, S., Zhung, W., Yang, S., Lim, J., & Kim, W.Y. (2022). PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions. Chem. Sci., 13, 3661–3673.

[8] Jiang, D., Hsieh, C.Y., Wu, Z., et al. (2021). InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein-Ligand Interaction Predictions. J. Med. Chem., 64(24), 18209–18232.

[9] Kyro, G.W., Brent, R.I., & Batista, V.S. (2023). HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction. J. Chem. Inf. Model., 63(6), 1868–1878.