|
|
--- |
|
|
tags: |
|
|
- pytorch |
|
|
- pyg |
|
|
- graph-neural-networks |
|
|
- machine-learning |
|
|
- barlow-twins |
|
|
- graph-isomorphism-network |
|
|
- molecular-biology |
|
|
- computational-biology |
|
|
- antibiotics |
|
|
- antimicrobial-discovery |
|
|
- high-throughput-screening |
|
|
- virtual-drug-screening |
|
|
- haicu |
|
|
- virtual-human-chc |
|
|
library_name: pytorch |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# MolE - Antimicrobial Prediction |
|
|
|
|
|
## Short description |
|
|
|
|
|
MolE learns task-independent molecular representations of chemicals via Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4). |
|
|
|
|
|
## Model versions |
|
|
|
|
|
**MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018). |
|
|
|
|
|
## Long description |
|
|
|
|
|
**MolE** is a non-contrastive self-supervised Graph Neural Network (GNN) framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining a pre-trained MolE representation with experimentally validated compound-bacteria activity data, the project builds an antimicrobial prediction model that re-discovers recently reported growth-inhibitory compounds |
|
|
that are structurally distinct from current antibiotics. Using the model as a compound prioritization strategy, three human-targeted drugs are identified and experimentally confirm as growth-inhibitors of Staphylococcus aureus, highlighting MolE’s potential to accelerate the discovery of new antibiotics. |
|
|
|
|
|
|
|
|
<!-- MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves: |
|
|
1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive |
|
|
molecular embeddings from SMILES strings. |
|
|
2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**. Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4). --> |
|
|
|
|
|
## Metadata |
|
|
|
|
|
### Input |
|
|
|
|
|
- **Description:** Table of chemicals with their SMILES representations. |
|
|
- **Input format:** |
|
|
- **Shape:** `[n, 2]`, where `n` is the number of chemical compounds |
|
|
- **Columns:** |
|
|
- `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*). |
|
|
- `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`). |
|
|
- **Example input file:** `input/examples_molecules.tsv` |
|
|
|
|
|
### Model |
|
|
|
|
|
- **Modality:** String representations of chemical compounds in SMILES format |
|
|
- **Scale:** Per bacterial strain and chemical compound. |
|
|
- **Description:** |
|
|
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process: |
|
|
1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`). |
|
|
2. Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`). The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain. |
|
|
- **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier |
|
|
- **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4) |
|
|
|
|
|
### Output |
|
|
|
|
|
- **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains. |
|
|
- **Output format:** table |
|
|
- **Shape:** `[n × 40, 2]`, where `n` is the number of chemical compounds |
|
|
- **Columns:** |
|
|
- `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`). |
|
|
- `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth. |
|
|
- **Example output file:** `output/example_molecules_prediction.tsv` |
|
|
|
|
|
## Installation |
|
|
|
|
|
Install the conda environment with all dependencies: |
|
|
|
|
|
``` bash |
|
|
# Create the conda environment called virtual-human-chc-mole |
|
|
conda env create -f environment.yaml |
|
|
|
|
|
# Activate the environment |
|
|
conda activate virtual-human-chc-mole |
|
|
``` |
|
|
|
|
|
## Example |
|
|
|
|
|
### Minimal inference example |
|
|
|
|
|
``` python |
|
|
import torch |
|
|
import yaml |
|
|
import pickle |
|
|
import pandas as pd |
|
|
from huggingface_hub import hf_hub_download |
|
|
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation |
|
|
|
|
|
class MolE: |
|
|
def __init__(self, device='auto'): |
|
|
repo = "virtual-human-chc/MolE" |
|
|
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu" |
|
|
|
|
|
# Download + load |
|
|
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml"))) |
|
|
self.model = ginet_concat.GINet(**cfg["model"]).to(self.device) |
|
|
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device)) |
|
|
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb")) |
|
|
|
|
|
def predict_from_smiles(self, smiles_tsv): |
|
|
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name") |
|
|
emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device) |
|
|
X_input = mole_antimicrobial_prediction.add_strains( |
|
|
emb, "input/maier_screening_results.tsv.gz" |
|
|
) |
|
|
probs = self.xgb.predict_proba(X_input)[:, 1] |
|
|
return pd.DataFrame( |
|
|
{"antimicrobial_predictive_probability": probs}, |
|
|
index=X_input.index |
|
|
) |
|
|
|
|
|
# Run inference |
|
|
mole = MolE() |
|
|
pred = mole.predict_from_smiles("input/examples_molecules.tsv") |
|
|
print(pred) |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
1. Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4). |
|
|
2. Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), <https://www.nature.com/articles/nature25979>. |
|
|
3. GitHub repository: <https://github.com/rolayoalarcon/MolE>. |
|
|
4. GitHub repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>. |
|
|
5. Model weights (Zenodo DOI): <https://doi.org/10.5281/zenodo.10803099>. |
|
|
|
|
|
## Copyright |
|
|
|
|
|
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT. |