Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,7 @@ tags:
|
|
| 13 |
- high-throughput-screening
|
| 14 |
- virtual-drug-screening
|
| 15 |
- haicu
|
|
|
|
| 16 |
library_name: pytorch
|
| 17 |
language:
|
| 18 |
- en
|
|
@@ -26,9 +27,7 @@ MolE learns task-independent molecular representations of chemicals via Graph Is
|
|
| 26 |
|
| 27 |
## Model versions
|
| 28 |
|
| 29 |
-
|
| 30 |
-
- Repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>
|
| 31 |
-
- Hugging Face Hub: `virtual-human-chc/MolE`
|
| 32 |
|
| 33 |
## Long description
|
| 34 |
|
|
@@ -42,16 +41,17 @@ molecular embeddings from SMILES strings.
|
|
| 42 |
### Input
|
| 43 |
|
| 44 |
- **Description:** Table of chemicals with their SMILES representations.
|
| 45 |
-
- **
|
| 46 |
-
- **
|
| 47 |
-
-
|
| 48 |
-
|
|
|
|
| 49 |
- **Example input file:** `input/examples_molecules.tsv`
|
| 50 |
|
| 51 |
### Model
|
| 52 |
|
| 53 |
- **Modality:** Probability of microbial growth inhibition.
|
| 54 |
-
- **Scale:**
|
| 55 |
- **Description:**
|
| 56 |
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
|
| 57 |
1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
|
|
@@ -132,80 +132,4 @@ print(pred)
|
|
| 132 |
|
| 133 |
## Copyright
|
| 134 |
|
| 135 |
-
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.
|
| 136 |
-
|
| 137 |
-
<!-- # MolE - Antimicrobial Prediction
|
| 138 |
-
|
| 139 |
-
This model uses MolE's pre-trained representation to train XGBoost models to predict the antimicrobial activity of compounds based on their molecular structure. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
|
| 140 |
-
|
| 141 |
-
## Files:
|
| 142 |
-
|
| 143 |
-
- `model.pth` - the pre-trained representation model's weights
|
| 144 |
-
- `config.yaml` - model configuration
|
| 145 |
-
- `MolE-XGBoost-08.03.2024_14.20.pkl` - pretrained XGBoost model
|
| 146 |
-
|
| 147 |
-
## Usage
|
| 148 |
-
|
| 149 |
-
### Inference Example
|
| 150 |
-
|
| 151 |
-
Below is a minimal example showing how to load and run inference with **MolE** directly from the Hugging Face Hub.
|
| 152 |
-
|
| 153 |
-
Necessary dependancies:
|
| 154 |
-
|
| 155 |
-
```bash
|
| 156 |
-
pip install torch pyyaml pandas huggingface-hub mole_package
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
```python
|
| 160 |
-
import torch
|
| 161 |
-
import yaml
|
| 162 |
-
import pickle
|
| 163 |
-
import pandas as pd
|
| 164 |
-
from huggingface_hub import hf_hub_download
|
| 165 |
-
from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation
|
| 166 |
-
|
| 167 |
-
class MolE:
|
| 168 |
-
def __init__(self, device='auto'):
|
| 169 |
-
repo = "virtual-human-chc/MolE"
|
| 170 |
-
self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
|
| 171 |
-
|
| 172 |
-
# Download + load
|
| 173 |
-
cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
|
| 174 |
-
self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
|
| 175 |
-
self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
|
| 176 |
-
self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
|
| 177 |
-
|
| 178 |
-
def predict_from_smiles(self, smiles_tsv):
|
| 179 |
-
smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
|
| 180 |
-
emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
|
| 181 |
-
X_input = mole_antimicrobial_prediction.add_strains(
|
| 182 |
-
emb, "data/01.prepare_training_data/maier_screening_results.tsv.gz"
|
| 183 |
-
)
|
| 184 |
-
probs = self.xgb.predict_proba(X_input)[:, 1]
|
| 185 |
-
return pd.DataFrame(
|
| 186 |
-
{"antimicrobial_predictive_probability": probs},
|
| 187 |
-
index=X_input.index
|
| 188 |
-
)
|
| 189 |
-
```
|
| 190 |
-
### Run inference:
|
| 191 |
-
|
| 192 |
-
```python
|
| 193 |
-
mole = MolE()
|
| 194 |
-
pred = mole.predict_from_smiles("examples/input/examples_molecules.tsv")
|
| 195 |
-
print(pred)
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
## Metadata
|
| 199 |
-
|
| 200 |
-
### Input
|
| 201 |
-
|
| 202 |
-
The input is a TSV file with two columns: `chem_name` and `smiles`. The column 'chem_name' contains the name of the molecule from PubChem, e.g. Halicin, and the column 'smiles' contains the chemical formula in SMILES format, e.g. `C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`. An example input is the file `examples\input\example_molecules.tsv`.
|
| 203 |
-
|
| 204 |
-
### Output
|
| 205 |
-
|
| 206 |
-
The output is a TSV file with two columns: `pred_id` and `antimicrobial_predictive_probability`. The column `pred_id` contains a given molecule and a bacteria, e.g. Halicin:Akkermansia muciniphila (NT5021), and the column `antimicrobial_predictive_probability` contains antimicrobial potential (AP) scores for
|
| 207 |
-
molecule prioritization, reflecting the chance of the given molecule having growth inhibition effect on the corresponding bacteria, e.g. 0.021192694. An example output is `examples/output/example_molecules_prediction.tsv`.
|
| 208 |
-
|
| 209 |
-
## Copyright
|
| 210 |
-
|
| 211 |
-
Code derived from https://github.com/rolayoalarcon/MolE is licensed under the MIT license, Copyright (c) 2024 Roberto Olayo Alarcon. The [model weights](https://doi.org/10.5281/zenodo.10803099) are licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode), Copyright (c) 2024 Roberto Olayo Alarcon. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. -->
|
|
|
|
| 13 |
- high-throughput-screening
|
| 14 |
- virtual-drug-screening
|
| 15 |
- haicu
|
| 16 |
+
- virtual-human-chc
|
| 17 |
library_name: pytorch
|
| 18 |
language:
|
| 19 |
- en
|
|
|
|
| 27 |
|
| 28 |
## Model versions
|
| 29 |
|
| 30 |
+
**MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Long description
|
| 33 |
|
|
|
|
| 41 |
### Input
|
| 42 |
|
| 43 |
- **Description:** Table of chemicals with their SMILES representations.
|
| 44 |
+
- **Input format:**
|
| 45 |
+
- **Shape:** `[n, 2]`
|
| 46 |
+
- **Columns:**
|
| 47 |
+
- `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
|
| 48 |
+
- `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
|
| 49 |
- **Example input file:** `input/examples_molecules.tsv`
|
| 50 |
|
| 51 |
### Model
|
| 52 |
|
| 53 |
- **Modality:** Probability of microbial growth inhibition.
|
| 54 |
+
- **Scale:** Per bacterial strain and chemical compound.
|
| 55 |
- **Description:**
|
| 56 |
- The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
|
| 57 |
1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
|
|
|
|
| 132 |
|
| 133 |
## Copyright
|
| 134 |
|
| 135 |
+
Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|