MolE / README.md
pavm595's picture
Update README.md
8ac3aaf verified
---
tags:
- pytorch
- pyg
- graph-neural-networks
- machine-learning
- barlow-twins
- graph-isomorphism-network
- molecular-biology
- computational-biology
- antibiotics
- antimicrobial-discovery
- high-throughput-screening
- virtual-drug-screening
- haicu
- virtual-human-chc
library_name: pytorch
language:
- en
---
# MolE - Antimicrobial Prediction
## Short description
MolE learns task-independent molecular representations of chemicals via Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
## Model versions
**MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).
## Long description
**MolE** is a non-contrastive self-supervised Graph Neural Network (GNN) framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining a pre-trained MolE representation with experimentally validated compound-bacteria activity data, the project builds an antimicrobial prediction model that re-discovers recently reported growth-inhibitory compounds
that are structurally distinct from current antibiotics. Using the model as a compound prioritization strategy, three human-targeted drugs are identified and experimentally confirm as growth-inhibitors of Staphylococcus aureus, highlighting MolE’s potential to accelerate the discovery of new antibiotics.
<!-- MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves:
1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive
molecular embeddings from SMILES strings.
2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**. Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4). -->
## Metadata
### Input
- **Description:** Table of chemicals with their SMILES representations.
- **Input format:**
- **Shape:** `[n, 2]`, where `n` is the number of chemical compounds
- **Columns:**
- `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
- `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
- **Example input file:** `input/examples_molecules.tsv`
### Model
- **Modality:** String representations of chemical compounds in SMILES format
- **Scale:** Per bacterial strain and chemical compound.
- **Description:**
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
2. Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`). The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain.
- **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
- **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)
### Output
- **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
- **Output format:** table
- **Shape:** `[n × 40, 2]`, where `n` is the number of chemical compounds
- **Columns:**
- `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`).
- `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth.
- **Example output file:** `output/example_molecules_prediction.tsv`
## Installation
Install the conda environment with all dependencies:
``` bash
# Create the conda environment called virtual-human-chc-mole
conda env create -f environment.yaml
# Activate the environment
conda activate virtual-human-chc-mole
```
## Example
### Minimal inference example
``` python
import torch
import yaml
import pickle
import pandas as pd
from huggingface_hub import hf_hub_download
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation
class MolE:
def __init__(self, device='auto'):
repo = "virtual-human-chc/MolE"
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
# Download + load
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
def predict_from_smiles(self, smiles_tsv):
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
X_input = mole_antimicrobial_prediction.add_strains(
emb, "input/maier_screening_results.tsv.gz"
)
probs = self.xgb.predict_proba(X_input)[:, 1]
return pd.DataFrame(
{"antimicrobial_predictive_probability": probs},
index=X_input.index
)
# Run inference
mole = MolE()
pred = mole.predict_from_smiles("input/examples_molecules.tsv")
print(pred)
```
## References
1. Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4).
2. Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), <https://www.nature.com/articles/nature25979>.
3. GitHub repository: <https://github.com/rolayoalarcon/MolE>.
4. GitHub repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>.
5. Model weights (Zenodo DOI): <https://doi.org/10.5281/zenodo.10803099>.
## Copyright
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.