File size: 7,192 Bytes
c807e0a a85dd1c 484392f c807e0a edbcc70 6ca63b3 122e011 6ca63b3 a85dd1c 6ca63b3 52fec32 7df6373 52fec32 6ca63b3 7df6373 a85dd1c 7850416 a85dd1c eac0f5d 6ca63b3 8ac3aaf a85dd1c 7df6373 774bc99 7df6373 6ca63b3 7df6373 7850416 7df6373 eac0f5d 6ca63b3 c7e3f7c 6ca63b3 c7e3f7c 6ca63b3 a85dd1c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
tags:
- pytorch
- pyg
- graph-neural-networks
- machine-learning
- barlow-twins
- graph-isomorphism-network
- molecular-biology
- computational-biology
- antibiotics
- antimicrobial-discovery
- high-throughput-screening
- virtual-drug-screening
- haicu
- virtual-human-chc
library_name: pytorch
language:
- en
---
# MolE - Antimicrobial Prediction
## Short description
MolE learns task-independent molecular representations of chemicals via Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
## Model versions
**MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).
## Long description
**MolE** is a non-contrastive self-supervised Graph Neural Network (GNN) framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining a pre-trained MolE representation with experimentally validated compound-bacteria activity data, the project builds an antimicrobial prediction model that re-discovers recently reported growth-inhibitory compounds
that are structurally distinct from current antibiotics. Using the model as a compound prioritization strategy, three human-targeted drugs are identified and experimentally confirm as growth-inhibitors of Staphylococcus aureus, highlighting MolE’s potential to accelerate the discovery of new antibiotics.
<!-- MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves:
1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive
molecular embeddings from SMILES strings.
2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**. Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4). -->
## Metadata
### Input
- **Description:** Table of chemicals with their SMILES representations.
- **Input format:**
- **Shape:** `[n, 2]`, where `n` is the number of chemical compounds
- **Columns:**
- `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
- `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
- **Example input file:** `input/examples_molecules.tsv`
### Model
- **Modality:** String representations of chemical compounds in SMILES format
- **Scale:** Per bacterial strain and chemical compound.
- **Description:**
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
2. Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`). The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain.
- **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
- **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)
### Output
- **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
- **Output format:** table
- **Shape:** `[n × 40, 2]`, where `n` is the number of chemical compounds
- **Columns:**
- `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`).
- `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth.
- **Example output file:** `output/example_molecules_prediction.tsv`
## Installation
Install the conda environment with all dependencies:
``` bash
# Create the conda environment called virtual-human-chc-mole
conda env create -f environment.yaml
# Activate the environment
conda activate virtual-human-chc-mole
```
## Example
### Minimal inference example
``` python
import torch
import yaml
import pickle
import pandas as pd
from huggingface_hub import hf_hub_download
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation
class MolE:
def __init__(self, device='auto'):
repo = "virtual-human-chc/MolE"
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
# Download + load
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
def predict_from_smiles(self, smiles_tsv):
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
X_input = mole_antimicrobial_prediction.add_strains(
emb, "input/maier_screening_results.tsv.gz"
)
probs = self.xgb.predict_proba(X_input)[:, 1]
return pd.DataFrame(
{"antimicrobial_predictive_probability": probs},
index=X_input.index
)
# Run inference
mole = MolE()
pred = mole.predict_from_smiles("input/examples_molecules.tsv")
print(pred)
```
## References
1. Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4).
2. Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), <https://www.nature.com/articles/nature25979>.
3. GitHub repository: <https://github.com/rolayoalarcon/MolE>.
4. GitHub repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>.
5. Model weights (Zenodo DOI): <https://doi.org/10.5281/zenodo.10803099>.
## Copyright
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT. |