MillerBind v9 & v12 — TDC Validation

Independent third-party validation of MillerBind scoring functions using the Therapeutics Data Commons (TDC) evaluation framework.

Developed by William Miller — BindStream Technologies

Results Summary

CASF-2016 Scoring Power Benchmark (n = 285, held out)

All metrics computed using tdc.Evaluator from PyTDC v1.1.15.

Model	PCC	PCC 95% CI	Spearman ρ	MAE (pKd)	MAE 95% CI	RMSE	R²
MillerBind v9	0.890	[0.862, 0.912]	0.877	0.780	[0.708, 0.857]	1.030	0.775
MillerBind v12	0.938	[0.921, 0.950]	0.960	0.637	[0.571, 0.707]	0.869	0.840

95% confidence intervals from 1,000 bootstrap resamples.

CASF-2016 Ranking Power (53 target clusters)

Ranking power measures whether the model correctly ranks ligands by affinity within each target protein cluster.

Model	Avg Spearman ρ	Avg Kendall τ	Concordance	Top-1 Success
X-Score	0.247	—	—	—
AutoDock Vina	0.281	—	—	—
RF-Score v3	0.464	—	—	—
ΔVinaRF20	0.476	—	—	—
OnionNet-2	0.488	—	—	—
MillerBind v9	0.740	0.662	82.7%	60.4%
MillerBind v12	0.979	0.962	97.9%	92.5%

v12 achieves near-perfect ranking across 53 protein targets — correctly identifying the strongest binder in 49/53 targets.

Comparison with Published Methods (Scoring Power)

Method	PCC	MAE (pKd)	Type	Year
AutoDock Vina	0.604	2.05	Physics-based	2010
RF-Score v3	0.800	1.40	Random Forest	2015
OnionNet-2	0.816	1.28	Deep Learning	2021
PIGNet	0.830	1.21	GNN	2022
IGN	0.850	1.15	GNN	2021
HAC-Net	0.860	1.10	DL Ensemble	2023
MillerBind v9	0.890	0.780	Proprietary ML	2025
MillerBind v12	0.938	0.637	Proprietary ML	2025

TDC BindingDB Cross-Reference

Metric	Value
TDC BindingDB_Kd targets with PDBbind structures	509 / 1,090 (46.7%)
PDBbind complexes matching TDC targets	8,384
TDC dataset structural coverage	49.5% (25,869 / 52,274)
v9 PCC on TDC-overlapping CASF-2016 subset (n=170)	0.880

Full Validation Report

The complete peer-review validation report with scatter plots, bootstrap confidence intervals, residual distributions, per-affinity-range analysis, and statistical significance tests is included in this repository:

View the Full Report (HTML) — download and open in any browser, or print to PDF.

Verify Results

Option 1: Run TDC Evaluator on predictions (quick)

pip install PyTDC numpy pandas scipy
python verify_with_tdc.py

This loads the pre-computed predictions CSV and evaluates them using TDC's official Evaluator.

Option 2: Docker — full independent validation (comprehensive)

docker run --rm bindstream/millerbind-v9-validation

The Docker image contains:

AES-256 encrypted model weights (not readable)
AES-256 encrypted CASF-2016 features (not readable)
Compiled Python bytecode (no source code)
Runs predictions and reports metrics — fully offline, no network needed

Repository Contents

├── README.md                          ← This file
├── predictions/
│   ├── casf2016_v9_predictions.csv    ← 285 predictions (PDB ID, experimental, predicted pKd)
│   └── casf2016_v12_predictions.csv   ← 285 predictions for v12
├── verify_with_tdc.py                 ← TDC Evaluator verification script
├── report/
│   └── MillerBind_TDC_Validation_Report.html  ← Full peer-review report with figures
├── Dockerfile                         ← Docker build reference (for transparency)
└── LICENSE

Why 3D Structures?

MillerBind is a structure-based scoring function — it requires 3D protein-ligand complex structures (PDB + ligand file) as input, not SMILES strings or amino acid sequences.

This is fundamentally different from sequence-based models (e.g., DeepDTA, MolTrans) that predict binding from 1D representations. Structure-based scoring uses the actual 3D atomic coordinates of both the protein and ligand, capturing:

Precise interatomic distances between protein and ligand atoms
Binding pocket geometry and shape complementarity
Hydrogen bonds, hydrophobic contacts, and electrostatic interactions in 3D space

This is why structure-based methods consistently outperform sequence-based methods on binding affinity benchmarks — they're scoring the real physical interaction, not inferring it from strings.

CASF-2016 is the gold-standard benchmark specifically designed for evaluating structure-based scoring functions (Su et al., 2019), and is the standard reported by AutoDock Vina, Glide, RF-Score, OnionNet, PIGNet, IGN, HAC-Net, and now MillerBind.

Model Details

	MillerBind v9	MillerBind v12
Input	3D protein-ligand complex (PDB + ligand file)	3D protein-ligand complex (PDB + ligand file)
Output	Predicted pKd	Predicted binding affinity
Use case	General-purpose scoring	PPI, hard targets, cancer, large proteins
Training data	PDBbind v2020 (18,438 complexes)	PDBbind v2020 (18,438 complexes)
Test set	CASF-2016 core set (285, strictly held out)	CASF-2016 core set (285, strictly held out)
Inference	< 1 second, CPU-only	< 1 second, CPU-only
Architecture	Proprietary	Proprietary

Statistical Significance

v9 PCC: p < 10⁻⁹⁸
v12 PCC: p < 10⁻¹³¹
v12 vs v9 improvement: paired t-test, t = 5.30, p = 2.4 × 10⁻⁷

References

Huang, K., et al. (2021). Therapeutics Data Commons. NeurIPS Datasets and Benchmarks.
Su, M., et al. (2019). Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model., 59(2), 895–913.
Wang, R., et al. (2004). The PDBbind Database. J. Med. Chem., 47(12), 2977–2980.

License

Results and predictions are provided for independent verification of benchmark performance.

Model weights, feature engineering, and training code are proprietary.