---
tags:
- pytorch
- pyg
- graph-neural-networks
- machine-learning
- barlow-twins
- graph-isomorphism-network
- molecular-biology
- computational-biology
- antibiotics
- antimicrobial-discovery
- high-throughput-screening
- virtual-drug-screening
- haicu
- virtual-human-chc
library_name: pytorch
language:
- en
---

# MolE - Antimicrobial Prediction

## Short description

MolE learns task-independent molecular representations of chemicals via Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).

## Model versions

**MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).

## Long description

**MolE** is a non-contrastive self-supervised Graph Neural Network (GNN) framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining a pre-trained MolE representation with experimentally validated compound-bacteria activity data, the project builds an antimicrobial prediction model that re-discovers recently reported growth-inhibitory compounds
that are structurally distinct from current antibiotics. Using the model as a compound prioritization strategy, three human-targeted drugs are identified and experimentally confirm as growth-inhibitors of Staphylococcus aureus, highlighting MolE’s potential to accelerate the discovery of new antibiotics.


<!-- MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves: 
1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive
molecular embeddings from SMILES strings.
2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**. Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4). -->

## Metadata

### Input

- **Description:** Table of chemicals with their SMILES representations.
- **Input format:**
  - **Shape:** `[n, 2]`, where `n` is the number of chemical compounds
  - **Columns:**
    - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
    - `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
- **Example input file:** `input/examples_molecules.tsv`

### Model

- **Modality:** String representations of chemical compounds in SMILES format
- **Scale:** Per bacterial strain and chemical compound.
- **Description:**
  - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
    1.  Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
    2.  Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`). The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain.
- **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
- **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)

### Output

- **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
- **Output format:** table
  - **Shape:** `[n × 40, 2]`, where `n` is the number of chemical compounds
  - **Columns:**
    - `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`).
    - `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth.
- **Example output file:** `output/example_molecules_prediction.tsv`

## Installation

Install the conda environment with all dependencies:

``` bash
# Create the conda environment called virtual-human-chc-mole
conda env create -f environment.yaml

# Activate the environment
conda activate virtual-human-chc-mole
```

## Example

### Minimal inference example

``` python
import torch
import yaml
import pickle
import pandas as pd
from huggingface_hub import hf_hub_download
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation

class MolE:
    def __init__(self, device='auto'):
        repo = "virtual-human-chc/MolE"
        self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"

        # Download + load
        cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
        self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
        self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
        self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))

    def predict_from_smiles(self, smiles_tsv):
        smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
        emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
        X_input = mole_antimicrobial_prediction.add_strains(
            emb, "input/maier_screening_results.tsv.gz"
        )
        probs = self.xgb.predict_proba(X_input)[:, 1]
        return pd.DataFrame(
            {"antimicrobial_predictive_probability": probs},
            index=X_input.index
        )
        
# Run inference
mole = MolE()
pred = mole.predict_from_smiles("input/examples_molecules.tsv")
print(pred)
```

## References

1.  Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4).
2.  Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), <https://www.nature.com/articles/nature25979>.
3.  GitHub repository: <https://github.com/rolayoalarcon/MolE>.
4.  GitHub repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>.
5.  Model weights (Zenodo DOI): <https://doi.org/10.5281/zenodo.10803099>.

## Copyright

Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.