MillerBind-Open v1
An open-weight, fully reproducible reference model for protein-ligand binding-affinity prediction.
Trained from scratch on public RCSB PDB / BindingDB data, using only
publicly-disclosed mathematics (see 01_build_dataset.py and
02_featurize_and_train.py in this repository for the complete, runnable
training recipe β data collection through final weights).
Author / inventor: William T. L. Miller
License: CC BY-NC 4.0 (free for research and non-commercial use)
Paper: PAPER.md / PAPER.pdf β full technical report (methods, data pipeline, honest evaluation, limitations)
What this is
MillerBind-Open is the open-weight member of the MillerBind family of structure-based binding-affinity models. It implements the publicly-disclosed core of the MillerBind method (patent pending, US provisional application no. 64/102,152):
- HIN-12 atom classification β every atom is folded into one of 12
harmonic classes by its atomic number:
HIN(Z) = 1 + ((Z - 1) mod 12). - Contact histogram features β for every protein-ligand atom pair within 8 Γ , the model accumulates a 12Γ12 raw contact-count histogram and a 12Γ12 distance-weighted contact histogram (297 features total, including geometry and phase-coherence summary statistics).
- Phase-coherence ratio β the publicly-disclosed modular rule
(
|HIN_p β HIN_l| mod 3 == 0) is used to compute a constructive/destructive contact ratio. - An ExtraTrees regressor, trained end-to-end on these raw features to predict a pKd-equivalent binding affinity.
This is a real, working, independently-trained model β not a stub, demo, or placeholder. Run it on any protein-ligand complex and get a genuine prediction (see Quickstart below).
What this is NOT
This is not MillerBind's private production model (v9 / v12), which powers the hosted platform and is validated at state-of-the-art accuracy on CASF-2016 and the TDC BindingDB_Kd leaderboard. The production models use a calibrated 12Γ12 atom-pair compatibility matrix, a calibrated 9Γ9 residue-pair matrix, calibrated HTOE energy-boost constants, and a calibrated XGBoost+ExtraTrees ensemble blend β all of which are trade secrets and are not included here, not derived here, and were not used to build this repository in any way.
MillerBind-Open instead lets the regressor learn pairwise interaction importance directly from a small public training set, with no precomputed compatibility matrix at all. It is intentionally simpler, and its accuracy (below) is intentionally and honestly weaker than the production system.
How to Use
Installation
git clone https://huggingface.co/williamTLmiller/millerbind-open-v1
cd millerbind-open-v1
pip install -r requirements.txt
Command line
python predict.py --complex your_complex.pdb --ligand-resname LIG
Or with separate protein/ligand files:
python predict.py --protein protein.pdb --ligand ligand.pdb --ligand-resname LIG
Example (using the bundled example_4dkl.pdb β PDB 4DKL, the ΞΌ-opioid
receptor bound to the antagonist BF0 β a structure with no public
binding-affinity annotation, so it is genuinely held out of this model's
training data):
python predict.py --complex example_4dkl.pdb --ligand-resname BF0
Protein atoms: 3476
Ligand atoms: 33
==========================================
MillerBind-Open v1 Prediction
==========================================
Predicted pAffinity (pKd-equivalent): 6.72
Predicted Kd-equivalent: 189.95 nM
Affinity: Moderate (uM)
==========================================
Python API
from predict import parse_protein, parse_ligand, predict
p_atoms, p_coords = parse_protein("protein.pdb")
l_atoms, l_coords = parse_ligand("ligand.pdb", resname="LIG")
p_affinity, kd_nm = predict(p_atoms, p_coords, l_atoms, l_coords)
print(f"pAffinity={p_affinity:.2f} Kd={kd_nm:.1f} nM")
Training data
621 protein-ligand complexes were assembled live from RCSB PDB's own
public rcsb_binding_affinity annotations (sourced from BindingDB,
queried via RCSB's public GraphQL API β see 01_build_dataset.py). For each
entry:
- A primary binding measurement was selected (preferring Kd, then Ki, then
IC50/EC50; median taken when multiple measurements of the same type exist)
and converted to a pKd-equivalent:
pAffinity = -log10(value_in_M). - The actual 3D structure was downloaded directly from RCSB PDB (public domain).
- Protein atoms (
ATOMrecords) and the specific annotated ligand (HETATMrecords matching the annotation's component ID) were parsed.
This is not a redistribution of any third-party curated dataset (e.g. PDBbind) β every data point is fetched live from RCSB's own public APIs by the included scripts, so the entire pipeline is independently reproducible by anyone with no special access.
Final affinity range: pAffinity 2.19β11.15 (a ~9-log-unit span β i.e. a genuinely diverse set of strong, moderate, and weak binders), mean 6.82, std 1.58.
Training procedure
See 01_build_dataset.py (data collection) and 02_featurize_and_train.py
(featurization + training) for the complete, runnable recipe. Summary:
- 80/20 random train/test split (random_state=42), 493 train / 124 test.
sklearn.ensemble.ExtraTreesRegressor: 400 trees, max_depth=12, min_samples_leaf=3, max_features="sqrt".- The released
model.joblibis refit on the full 617-complex dataset (after dropping a handful of complexes with fewer than 3 contacts); metrics below are from the held-out split before that final refit.
Evaluation (held-out test set, n=124)
| Metric | Value |
|---|---|
| Pearson R | 0.623 |
| Spearman Ο | 0.593 |
| MAE (pAffinity) | 0.999 |
| RMSE (pAffinity) | 1.238 |
For context: AutoDock Vina scores Rβ0.60 on the CASF-2016 benchmark; this
open model's Rβ0.62 on its own (different, smaller, more diverse) held-out
set is a believable result for a from-scratch baseline with ~500 training
examples and no calibrated chemistry knowledge β consistent with, and
nowhere near, the private production model's published CASF-2016 results
(R=0.890 for v9, R=0.938 for v12). See metrics.json for the full
held-out PDB ID list used in this evaluation.
This is not a claim of state-of-the-art performance. It is an honest, reproducible, small-data reference baseline.
Intended use
- Research, education, and benchmarking of structure-based binding-affinity methods.
- A transparent, fully-reproducible reference implementation of the publicly-disclosed MillerBind feature-engineering approach (HIN-12 folding, contact histograms, phase-coherence).
- A starting point for further open research β the training scripts are included specifically so the community can extend, retrain on a larger public corpus, or build on this baseline.
Not intended for: clinical use, FDA submissions, or any decision with real-world health or safety consequences. Predictions are not validated for production drug-discovery decisions β for that, see the commercially licensed production models.
Limitations
- Trained on only 617 complexes β small by modern ML standards. Expect higher variance and lower accuracy than models trained on the full PDBbind corpus (18,000+ complexes).
- No explicit water modeling, no pose generation/docking, no learned compatibility matrix β the model only sees raw geometric contact histograms.
- Ligand selection in
--complexmode without--ligand-resnameuses a simple heuristic (first non-solvent/non-ion HETATM group) and may pick the wrong group in complexes with multiple bound ligands; specify--ligand-resnameexplicitly for reliable results. - No uncertainty quantification beyond the held-out metrics above.
Relationship to the patent
This implementation is consistent with and disclosed by US provisional patent application no. 64/102,152 ("System and method for predictive modeling of structured systems using a modular harmonic fold map, compatibility-matrix interaction scoring, phase-coherence noise filtering, and multi-scale modular-shell feature decomposition"), filed by William T. L. Miller. The patent's "learned end-to-end" embodiment (as opposed to the calibrated-matrix embodiment used by the private production models) is what this repository implements and releases as open weights.
Commercial use & the production models
For commercial use, higher accuracy, or access to the state-of-the-art production models (CASF-2016 R=0.938, TDC BindingDB_Kd leaderboard results), contact the inventor, William T. L. Miller, directly.
Citation
@software{miller2026millerbindopen,
author = {Miller, William T. L.},
title = {{MillerBind-Open}: an open-weight reference model for
protein-ligand binding-affinity prediction},
year = {2026},
license = {CC-BY-NC-4.0}
}
Running the tests
pip install -r requirements.txt pytest
pytest tests/ -v
9 tests cover the HIN-12 fold map, feature computation, end-to-end prediction, numerical stability, error handling, and a guard test that fails the build if any private module is ever accidentally imported.
Repository layout
millerbind-open-v1/
βββ README.md # this file (model card)
βββ LICENSE.md # CC BY-NC 4.0
βββ CITATION.cff # machine-readable citation
βββ CHANGELOG.md
βββ CONTRIBUTING.md
βββ CODE_OF_CONDUCT.md
βββ requirements.txt
βββ model.joblib # trained ExtraTreesRegressor (open weights)
βββ feature_names.json # ordered list of the 297 input features
βββ metrics.json # held-out metrics + train/test PDB ID split
βββ predict.py # standalone CLI + Python API
βββ example_4dkl.pdb # sample complex used in the Quickstart
βββ 01_build_dataset.py # reproducible data collection (RCSB/BindingDB)
βββ 02_featurize_and_train.py # reproducible featurization + training
βββ tests/
βββ test_predict.py
Files in this repository
| File | Description |
|---|---|
model.joblib |
Trained ExtraTreesRegressor (open weights) |
feature_names.json |
Ordered list of the 297 input feature names |
metrics.json |
Held-out evaluation metrics + train/test PDB ID split |
predict.py |
Standalone predictor β no dependency on any private code |
01_build_dataset.py |
Reproducible data-collection script (RCSB/BindingDB) |
02_featurize_and_train.py |
Reproducible feature engineering + training script |
tests/test_predict.py |
Test suite (run with pytest tests/ -v) |
requirements.txt |
Python dependencies |
LICENSE.md |
CC BY-NC 4.0 license text |
CITATION.cff |
Machine-readable citation metadata |
CHANGELOG.md |
Release history |
CONTRIBUTING.md |
How to contribute |
CODE_OF_CONDUCT.md |
Community guidelines |
The architecture is public so anyone can read it, audit it, retrain it, or build on it. The production calibration and trained weights remain private under separate commercial license.
Evaluation results
- Pearson R (held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)self-reported0.623
- Spearman rho (held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)self-reported0.593
- MAE (pAffinity units, held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)self-reported0.999
- RMSE (pAffinity units, held-out, n=124) on RCSB PDB / BindingDB public affinity annotations (617 complexes, custom-built)self-reported1.238