data csvs - aggregated across assays?
We noticed that the released dataset appears smaller than what is reported for Boltz2. In addition, there seem to be only one or two rows per protein–ligand pair (whereas in the source data the same pair often appears across multiple assays), many assay_id fields are NA, and there is a pIC50_IQR column. Together, these observations suggest that affinity measurements may have been aggregated across assays.
Could you clarify whether aggregation was performed? If so, we would greatly appreciate details on the aggregation strategy. Access to the pre-aggregation dataset would also be extremely helpful. For example, Boltz2-style affinity training relies on assay-level information for batch construction and components of the loss function, so assay granularity is important for reproducibility and methodological comparisons.
If the data were not aggregated, could you help clarify the apparent differences in dataset size relative to Boltz2, as well as the interpretation of the pIC50_IQR column?
More generally, we look forward to the release of the SQL queries and preprocessing scripts, which will help us better understand and reproduce the dataset construction process.
Thank you in advance for your clarification!