Questions for training / fine-tuning

#5
by KirillShmilovich - opened

Thanks for open-sourcing this work! I'm potentially interested in some fine-tuning and have a few questions:

  1. pIC50 conversion
    Is the correct conversion from pIC50 to your training scale (e.g., the regression target values in data/training_data.csv):
    binding_affinity = pIC50 - 6
    (Based on output being -log10(IC50_μM))

  2. Default assay_batch_size=1 behavior
    With the default assay_batch_size=1, the CliffLoss relative component (90% weight) computes 0 pairs since you need ≥2 samples from the same assay for pairwise comparisons. Is this intentional, or should users increase assay_batch_size for training?

  3. Recommended training hyperparameters
    What settings were used to train the released checkpoint? The OpenFold3 defaults (epoch_len=4, batch_size=1) seem like placeholders. Any guidance on:

  • assay_batch_size
  • epoch_len
  • batch_size
  • Number of epochs
  1. Data source for query_ids
    Do the numeric query_id values map to BindingDB entry IDs? And do binding_affinity_dataset_* / PUBCHEM_* assay IDs correspond to BindingDB/PubChem assay IDs?

  2. Pre-computed embeddings
    Are there plans to release the cached OpenFold3 embeddings, or is the expectation that users generate their own from scratch?

Thanks!

KirillShmilovich changed discussion title from Questions for reproducing training / fine-tuning on custom data to Questions for training / fine-tuning

Sign up or log in to comment