Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,154 @@ language:
|
|
| 18 |
- en
|
| 19 |
---
|
| 20 |
|
| 21 |
-
# MolE
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
This model uses MolE's pre-trained representation to train XGBoost models to predict the antimicrobial activity of compounds based on their molecular structure. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
|
| 24 |
|
|
@@ -92,4 +239,4 @@ molecule prioritization, reflecting the chance of the given molecule having grow
|
|
| 92 |
|
| 93 |
## Copyright
|
| 94 |
|
| 95 |
-
Code derived from https://github.com/rolayoalarcon/MolE is licensed under the MIT license, Copyright (c) 2024 Roberto Olayo Alarcon. The [model weights](https://doi.org/10.5281/zenodo.10803099) are licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode), Copyright (c) 2024 Roberto Olayo Alarcon. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.
|
|
|
|
| 18 |
- en
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# MolE -- Antimicrobial Prediction
|
| 22 |
+
|
| 23 |
+
## Short description
|
| 24 |
+
|
| 25 |
+
MolE learns task-independent molecular representations of chemicals vis Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound
|
| 26 |
+
inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the GitHub repository and the accompanying paper.
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
## Model versions
|
| 30 |
+
|
| 31 |
+
- **MolE Antimicrobial Prediction** (March 2024)
|
| 32 |
+
Pretrained representation model + XGBoost classifier trained on
|
| 33 |
+
antimicrobial screening data (Maier et al., 2018).
|
| 34 |
+
Repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>\
|
| 35 |
+
Hugging Face Hub: `virtual-human-chc/MolE`
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
## Long description
|
| 39 |
+
|
| 40 |
+
MolE integrates molecular graph-based representation learning with
|
| 41 |
+
gradient-boosted decision trees for predicting antimicrobial potential.
|
| 42 |
+
The approach involves: 1. **Representation learning:** A graph neural
|
| 43 |
+
network (GINet) trained on 100,000 randomly sampled compounds to derive
|
| 44 |
+
molecular embeddings from SMILES strings. 2. **Prediction:** These
|
| 45 |
+
embeddings are used as input to an **XGBoost** model that predicts
|
| 46 |
+
antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*.
|
| 47 |
+
|
| 48 |
+
The model was developed by **Roberto Olayo Alarcon et al.**.
|
| 49 |
+
Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4).
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
## Metadata
|
| 53 |
+
|
| 54 |
+
### Input
|
| 55 |
+
|
| 56 |
+
- **Description:** Table of chemicals with their SMILES
|
| 57 |
+
representations.
|
| 58 |
+
- **Shape:** `[n, 2]`
|
| 59 |
+
- **Data format:**
|
| 60 |
+
- `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
|
| 61 |
+
- `smiles` *(str)*: SMILES representation of the molecule (e.g.,
|
| 62 |
+
`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
|
| 63 |
+
- **Example input file:** `examples/input/examples_molecules.tsv`
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
### Model
|
| 67 |
+
|
| 68 |
+
- **Modality:** Probability of microbial growth inhibition.
|
| 69 |
+
- **Scale:** Combination of bacterial strain and compound.
|
| 70 |
+
- **Description:**
|
| 71 |
+
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
|
| 72 |
+
1. Generate molecular embeddings with a pre-trained **GINet**
|
| 73 |
+
representation model (`model.pth`, `config.yaml`).
|
| 74 |
+
2. Predict antimicrobial properties with a **trained XGBoost**
|
| 75 |
+
classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`).
|
| 76 |
+
- The scores reflect the likelihood that a compound inhibits the
|
| 77 |
+
growth of a given bacterial strain.
|
| 78 |
+
- **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
|
| 79 |
+
- **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
### Output
|
| 83 |
+
|
| 84 |
+
- **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
|
| 85 |
+
- **Output format:** table
|
| 86 |
+
- **Shape:** `[n × 40, 2]`
|
| 87 |
+
- **Columns**
|
| 88 |
+
- `pred_id` *(str)*: Combination of the molecule name and
|
| 89 |
+
bacterial strain (e.g.,
|
| 90 |
+
`Halicin:Akkermansia muciniphila (NT5021)`).
|
| 91 |
+
- `antimicrobial_predictive_probability` *(float)*: Predicted
|
| 92 |
+
probability that the compound inhibits microbial growth.
|
| 93 |
+
- **Example output file:**
|
| 94 |
+
`examples/output/example_molecules_prediction.tsv`
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
## Installation
|
| 98 |
+
|
| 99 |
+
Install the conda environment with all dependencies:
|
| 100 |
+
|
| 101 |
+
``` bash
|
| 102 |
+
# Create the conda environment called virtual-human-chc-mole
|
| 103 |
+
conda env create -f environment.yaml
|
| 104 |
+
|
| 105 |
+
# Activate the environment
|
| 106 |
+
conda activate virtual-human-chc-mole
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
## Example
|
| 110 |
+
|
| 111 |
+
### Minimal inference example
|
| 112 |
+
|
| 113 |
+
``` python
|
| 114 |
+
import torch
|
| 115 |
+
import yaml
|
| 116 |
+
import pickle
|
| 117 |
+
import pandas as pd
|
| 118 |
+
from huggingface_hub import hf_hub_download
|
| 119 |
+
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation
|
| 120 |
+
|
| 121 |
+
class MolE:
|
| 122 |
+
def __init__(self, device='auto'):
|
| 123 |
+
repo = "virtual-human-chc/MolE"
|
| 124 |
+
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
|
| 125 |
+
|
| 126 |
+
# Download + load
|
| 127 |
+
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
|
| 128 |
+
self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
|
| 129 |
+
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
|
| 130 |
+
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
|
| 131 |
+
|
| 132 |
+
def predict_from_smiles(self, smiles_tsv):
|
| 133 |
+
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
|
| 134 |
+
emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
|
| 135 |
+
X_input = mole_antimicrobial_prediction.add_strains(
|
| 136 |
+
emb, "data/01.prepare_training_data/maier_screening_results.tsv.gz"
|
| 137 |
+
)
|
| 138 |
+
probs = self.xgb.predict_proba(X_input)[:, 1]
|
| 139 |
+
return pd.DataFrame(
|
| 140 |
+
{"antimicrobial_predictive_probability": probs},
|
| 141 |
+
index=X_input.index
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
# Run inference
|
| 145 |
+
mole = MolE()
|
| 146 |
+
pred = mole.predict_from_smiles("examples/input/examples_molecules.tsv")
|
| 147 |
+
print(pred)
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
## References
|
| 152 |
+
|
| 153 |
+
1. Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4).
|
| 154 |
+
2. Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), <https://www.nature.com/articles/nature25979>.
|
| 155 |
+
3. GitHub repository: <https://github.com/rolayoalarcon/MolE>.
|
| 156 |
+
4. GitHub repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>.
|
| 157 |
+
5. Model weights (Zenodo DOI): <https://doi.org/10.5281/zenodo.10803099>.
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
## Copyright
|
| 161 |
+
|
| 162 |
+
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed
|
| 163 |
+
under the **MIT License**, © 2024 Roberto Olayo Alarcon.
|
| 164 |
+
Model weights are licensed under **Creative Commons Attribution 4.0
|
| 165 |
+
International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon.
|
| 166 |
+
Additional code © 2025 Maksim Pavlov, licensed under MIT.
|
| 167 |
+
|
| 168 |
+
<!-- # MolE - Antimicrobial Prediction
|
| 169 |
|
| 170 |
This model uses MolE's pre-trained representation to train XGBoost models to predict the antimicrobial activity of compounds based on their molecular structure. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
|
| 171 |
|
|
|
|
| 239 |
|
| 240 |
## Copyright
|
| 241 |
|
| 242 |
+
Code derived from https://github.com/rolayoalarcon/MolE is licensed under the MIT license, Copyright (c) 2024 Roberto Olayo Alarcon. The [model weights](https://doi.org/10.5281/zenodo.10803099) are licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode), Copyright (c) 2024 Roberto Olayo Alarcon. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. -->
|