Questions for training / fine-tuning
Thanks for open-sourcing this work! I'm potentially interested in some fine-tuning and have a few questions:
pIC50 conversion
Is the correct conversion from pIC50 to your training scale (e.g., the regression target values indata/training_data.csv):
binding_affinity = pIC50 - 6
(Based on output being -log10(IC50_μM))Default assay_batch_size=1 behavior
With the default assay_batch_size=1, the CliffLoss relative component (90% weight) computes 0 pairs since you need ≥2 samples from the same assay for pairwise comparisons. Is this intentional, or should users increase assay_batch_size for training?Recommended training hyperparameters
What settings were used to train the released checkpoint? The OpenFold3 defaults (epoch_len=4, batch_size=1) seem like placeholders. Any guidance on:
- assay_batch_size
- epoch_len
- batch_size
- Number of epochs
Data source for query_ids
Do the numeric query_id values map to BindingDB entry IDs? And do binding_affinity_dataset_* / PUBCHEM_* assay IDs correspond to BindingDB/PubChem assay IDs?Pre-computed embeddings
Are there plans to release the cached OpenFold3 embeddings, or is the expectation that users generate their own from scratch?
Thanks!