OpenADMET ExpansionRx Challenge - Methodology Report
Account Name: WakuwakuADMET
1. Model Description
Algorithm:
Training Strategy:
2. External Data
- External Data Sources: In-house (not public)
| Endpoint |
External Data Use |
| LogD |
Yes |
| KSOL |
Yes |
| MLM CLint |
No |
| HLM CLint |
no |
| Caco-2 Permeability Efflux |
Yes |
| Caco-2 Permeability Papp A>B |
No |
| MPPB |
No |
| MBPB |
No |
| MGMB |
No |
3. Performance Comments
- Performance between train/validation is consistent
- The use of external datasets was effective for LogD and Caco-2 Permeability Efflux.
- The binding tasks (MPPB, MBPB, and MGMB) exhibit moderate to strong correlations with each other.
Therefore, transfer learning for the MGMB task using the MBPB prediction value as an input feature was effective.
- QM representation on molecular graphs and 3D graph neural networks were effective for binding tasks.
4. Ensemble Strategy
- Aggregation Method: Simple averaging
- Model Diversity: Different CV folds
5. Additional Features / Molecular Representations
- Fingerprints/Descriptors: Not used
- Learned Embeddings: QM representations on molecular graphs were obtained using AIMNet2 (MPPB, MBPB).
- Prediction Value: MBPB prediction value was used as a input feature of MGMB prediction task due to the strong correlation between MBPB and MGMB in the training dataset.
6. Data Preprocessing
- Target Transformation: log10 for all endpoints except LogD
- Zero/Missing Value Handling: Replace 0 with smallest non-zero value (LogD, KSOL, Caco-2 Permeability Efflux, Caco-2 Permeability Papp A>B), Exclude zeros (HLM CLint, MLM CLint, MPPB, MBPB, MGMB)
- SMILES Standardization: RDKit canonicalization
7. Loss Function / Validation / Split Strategy
- Loss Type: MSE
- Cross-Validation: 5-fold CV
- Split Method: Random split
- Early Stopping: Yes
8. Negative Results / What Didn't Work
- For some endpoints (CacoA>B, MPPB, HLM CLint), our in-house data did not improve the validation R-squared — either as additional training data or as pretraining data. We suspect this may be driven by differences in wet-lab assay protocols, leading to a distribution shift between datasets.
9. References