--- tags: - pytorch - pyg - graph-neural-networks - machine-learning - barlow-twins - graph-isomorphism-network - molecular-biology - computational-biology - antibiotics - antimicrobial-discovery - high-throughput-screening - virtual-drug-screening - haicu - virtual-human-chc library_name: pytorch language: - en --- # MolE - Antimicrobial Prediction ## Short description MolE learns task-independent molecular representations of chemicals via Graph Isomorphism Networks (GINs). Combined with an XGBoost classifier it estimates the probability of a compound inhibiting bacterial growth. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4). ## Model versions **MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018). ## Long description **MolE** is a non-contrastive self-supervised Graph Neural Network (GNN) framework that leverages unlabeled chemical structures to learn task-independent molecular representations. By combining a pre-trained MolE representation with experimentally validated compound-bacteria activity data, the project builds an antimicrobial prediction model that re-discovers recently reported growth-inhibitory compounds that are structurally distinct from current antibiotics. Using the model as a compound prioritization strategy, three human-targeted drugs are identified and experimentally confirm as growth-inhibitors of Staphylococcus aureus, highlighting MolE’s potential to accelerate the discovery of new antibiotics. ## Metadata ### Input - **Description:** Table of chemicals with their SMILES representations. - **Input format:** - **Shape:** `[n, 2]`, where `n` is the number of chemical compounds - **Columns:** - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*). - `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`). - **Example input file:** `input/examples_molecules.tsv` ### Model - **Modality:** String representations of chemical compounds in SMILES format - **Scale:** Per bacterial strain and chemical compound. - **Description:** - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process: 1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`). 2. Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`). The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain. - **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier - **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4) ### Output - **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains. - **Output format:** table - **Shape:** `[n × 40, 2]`, where `n` is the number of chemical compounds - **Columns:** - `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`). - `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth. - **Example output file:** `output/example_molecules_prediction.tsv` ## Installation Install the conda environment with all dependencies: ``` bash # Create the conda environment called virtual-human-chc-mole conda env create -f environment.yaml # Activate the environment conda activate virtual-human-chc-mole ``` ## Example ### Minimal inference example ``` python import torch import yaml import pickle import pandas as pd from huggingface_hub import hf_hub_download from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation class MolE: def __init__(self, device='auto'): repo = "virtual-human-chc/MolE" self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu" # Download + load cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml"))) self.model = ginet_concat.GINet(**cfg["model"]).to(self.device) self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device)) self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb")) def predict_from_smiles(self, smiles_tsv): smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name") emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device) X_input = mole_antimicrobial_prediction.add_strains( emb, "input/maier_screening_results.tsv.gz" ) probs = self.xgb.predict_proba(X_input)[:, 1] return pd.DataFrame( {"antimicrobial_predictive_probability": probs}, index=X_input.index ) # Run inference mole = MolE() pred = mole.predict_from_smiles("input/examples_molecules.tsv") print(pred) ``` ## References 1. Roberto Olayo Alarcon et al., *MolE: Graph-based molecular representation learning for antimicrobial discovery*, [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4). 2. Maier et al., *Extensive antimicrobial screening of small molecules*, *Nature* (2018), . 3. GitHub repository: . 4. GitHub repository: . 5. Model weights (Zenodo DOI): . ## Copyright Code derived from is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.