pavm595 commited on
Commit
a85dd1c
·
verified ·
1 Parent(s): eac0f5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -85
README.md CHANGED
@@ -13,6 +13,7 @@ tags:
13
  - high-throughput-screening
14
  - virtual-drug-screening
15
  - haicu
 
16
  library_name: pytorch
17
  language:
18
  - en
@@ -26,9 +27,7 @@ MolE learns task-independent molecular representations of chemicals via Graph Is
26
 
27
  ## Model versions
28
 
29
- - **MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).
30
- - Repository: <https://github.com/rolayoalarcon/mole_antimicrobial_potential>
31
- - Hugging Face Hub: `virtual-human-chc/MolE`
32
 
33
  ## Long description
34
 
@@ -42,16 +41,17 @@ molecular embeddings from SMILES strings.
42
  ### Input
43
 
44
  - **Description:** Table of chemicals with their SMILES representations.
45
- - **Shape:** `[n, 2]`
46
- - **Data format:**
47
- - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
48
- - `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
 
49
  - **Example input file:** `input/examples_molecules.tsv`
50
 
51
  ### Model
52
 
53
  - **Modality:** Probability of microbial growth inhibition.
54
- - **Scale:** Combination of bacterial strain and compound.
55
  - **Description:**
56
  - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
57
  1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
@@ -132,80 +132,4 @@ print(pred)
132
 
133
  ## Copyright
134
 
135
- Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.
136
-
137
- <!-- # MolE - Antimicrobial Prediction
138
-
139
- This model uses MolE's pre-trained representation to train XGBoost models to predict the antimicrobial activity of compounds based on their molecular structure. The model was developed by Roberto Olayo Alarcon et al. and more information can be found in the [GitHub repository](https://github.com/rolayoalarcon/MolE) and the [accompanying paper](https://www.nature.com/articles/s41467-025-58804-4).
140
-
141
- ## Files:
142
-
143
- - `model.pth` - the pre-trained representation model's weights
144
- - `config.yaml` - model configuration
145
- - `MolE-XGBoost-08.03.2024_14.20.pkl` - pretrained XGBoost model
146
-
147
- ## Usage
148
-
149
- ### Inference Example
150
-
151
- Below is a minimal example showing how to load and run inference with **MolE** directly from the Hugging Face Hub.
152
-
153
- Necessary dependancies:
154
-
155
- ```bash
156
- pip install torch pyyaml pandas huggingface-hub mole_package
157
- ```
158
-
159
- ```python
160
- import torch
161
- import yaml
162
- import pickle
163
- import pandas as pd
164
- from huggingface_hub import hf_hub_download
165
- from mole_package import ginet_concat, mole_antimicrobial_prediction, mole_representation, dataset_representation
166
-
167
- class MolE:
168
- def __init__(self, device='auto'):
169
- repo = "virtual-human-chc/MolE"
170
- self.device = "cuda:0" if device == "auto" and torch.cuda.is_available() else "cpu"
171
-
172
- # Download + load
173
- cfg = yaml.safe_load(open(hf_hub_download(repo, "config.yaml")))
174
- self.model = ginet_concat.GINet(**cfg["model"]).to(self.device)
175
- self.model.load_state_dict(torch.load(hf_hub_download(repo, "model.pth"), map_location=self.device))
176
- self.xgb = pickle.load(open(hf_hub_download(repo, "MolE-XGBoost-08.03.2024_14.20.pkl"), "rb"))
177
-
178
- def predict_from_smiles(self, smiles_tsv):
179
- smiles_df = mole_representation.read_smiles(smiles_tsv, "smiles", "chem_name")
180
- emb = dataset_representation.batch_representation(smiles_df, self.model, "smiles", "chem_name", device=self.device)
181
- X_input = mole_antimicrobial_prediction.add_strains(
182
- emb, "data/01.prepare_training_data/maier_screening_results.tsv.gz"
183
- )
184
- probs = self.xgb.predict_proba(X_input)[:, 1]
185
- return pd.DataFrame(
186
- {"antimicrobial_predictive_probability": probs},
187
- index=X_input.index
188
- )
189
- ```
190
- ### Run inference:
191
-
192
- ```python
193
- mole = MolE()
194
- pred = mole.predict_from_smiles("examples/input/examples_molecules.tsv")
195
- print(pred)
196
- ```
197
-
198
- ## Metadata
199
-
200
- ### Input
201
-
202
- The input is a TSV file with two columns: `chem_name` and `smiles`. The column 'chem_name' contains the name of the molecule from PubChem, e.g. Halicin, and the column 'smiles' contains the chemical formula in SMILES format, e.g. `C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`. An example input is the file `examples\input\example_molecules.tsv`.
203
-
204
- ### Output
205
-
206
- The output is a TSV file with two columns: `pred_id` and `antimicrobial_predictive_probability`. The column `pred_id` contains a given molecule and a bacteria, e.g. Halicin:Akkermansia muciniphila (NT5021), and the column `antimicrobial_predictive_probability` contains antimicrobial potential (AP) scores for
207
- molecule prioritization, reflecting the chance of the given molecule having growth inhibition effect on the corresponding bacteria, e.g. 0.021192694. An example output is `examples/output/example_molecules_prediction.tsv`.
208
-
209
- ## Copyright
210
-
211
- Code derived from https://github.com/rolayoalarcon/MolE is licensed under the MIT license, Copyright (c) 2024 Roberto Olayo Alarcon. The [model weights](https://doi.org/10.5281/zenodo.10803099) are licensed under [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode), Copyright (c) 2024 Roberto Olayo Alarcon. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov. -->
 
13
  - high-throughput-screening
14
  - virtual-drug-screening
15
  - haicu
16
+ - virtual-human-chc
17
  library_name: pytorch
18
  language:
19
  - en
 
27
 
28
  ## Model versions
29
 
30
+ **MolE Antimicrobial Prediction** (March 2024): Pretrained representation model + XGBoost classifier trained on antimicrobial screening data (Maier et al., 2018).
 
 
31
 
32
  ## Long description
33
 
 
41
  ### Input
42
 
43
  - **Description:** Table of chemicals with their SMILES representations.
44
+ - **Input format:**
45
+ - **Shape:** `[n, 2]`
46
+ - **Columns:**
47
+ - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
48
+ - `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
49
  - **Example input file:** `input/examples_molecules.tsv`
50
 
51
  ### Model
52
 
53
  - **Modality:** Probability of microbial growth inhibition.
54
+ - **Scale:** Per bacterial strain and chemical compound.
55
  - **Description:**
56
  - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
57
  1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
 
132
 
133
  ## Copyright
134
 
135
+ Code derived from <https://github.com/rolayoalarcon/MolE> is licensed under the **MIT License**, © 2024 Roberto Olayo Alarcon. Model weights are licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**, © 2024 Roberto Olayo Alarcon. Additional code © 2025 Maksim Pavlov, licensed under MIT.