pavm595 commited on
Commit
7df6373
·
verified ·
1 Parent(s): afda5c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -35
README.md CHANGED
@@ -32,52 +32,43 @@ MolE learns task-independent molecular representations of chemicals via Graph Is
32
 
33
  ## Long description
34
 
35
- MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves: 1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive
36
- molecular embeddings from SMILES strings. 2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**.
37
- Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4).
 
38
 
39
  ## Metadata
40
 
41
  ### Input
42
 
43
- - **Description:** Table of chemicals with their SMILES
44
- representations.
45
- - **Shape:** `[n, 2]`
46
- - **Data format:**
47
- - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
48
- - `smiles` *(str)*: SMILES representation of the molecule (e.g.,
49
- `C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
50
- - **Example input file:** `examples/input/examples_molecules.tsv`
51
-
52
 
53
  ### Model
54
 
55
- - **Modality:** Probability of microbial growth inhibition.
56
- - **Scale:** Combination of bacterial strain and compound.
57
- - **Description:**
58
- - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
59
- 1. Generate molecular embeddings with a pre-trained **GINet**
60
- representation model (`model.pth`, `config.yaml`).
61
- 2. Predict antimicrobial properties with a **trained XGBoost**
62
- classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`).
63
- - The scores reflect the likelihood that a compound inhibits the
64
- growth of a given bacterial strain.
65
- - **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
66
- - **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)
67
 
68
  ### Output
69
 
70
- - **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
71
- - **Output format:** table
72
- - **Shape:** `[n × 40, 2]`
73
- - **Columns**
74
- - `pred_id` *(str)*: Combination of the molecule name and
75
- bacterial strain (e.g.,
76
- `Halicin:Akkermansia muciniphila (NT5021)`).
77
- - `antimicrobial_predictive_probability` *(float)*: Predicted
78
- probability that the compound inhibits microbial growth.
79
- - **Example output file:**
80
- `examples/output/example_molecules_prediction.tsv`
81
 
82
  ## Installation
83
 
 
32
 
33
  ## Long description
34
 
35
+ MolE integrates molecular graph-based representation learning with gradient-boosted decision trees for predicting antimicrobial potential. The approach involves:
36
+ 1. **Representation learning:** A graph neural network (GINet) trained on 100,000 randomly sampled compounds to derive
37
+ molecular embeddings from SMILES strings.
38
+ 2. **Prediction:** These embeddings are used as input to an **XGBoost** model that predicts antimicrobial activity scores across 40 bacterial strains, based on data from *Maier et al., 2018*. The model was developed by **Roberto Olayo Alarcon et al.**. Further information is available in the [paper](https://www.nature.com/articles/s41467-025-58804-4).
39
 
40
  ## Metadata
41
 
42
  ### Input
43
 
44
+ - **Description:** Table of chemicals with their SMILES representations.
45
+ - **Shape:** `[n, 2]`
46
+ - **Data format:**
47
+ - `chem_name` *(str)*: Name of the molecule (e.g., *Halicin*).
48
+ - `smiles` *(str)*: SMILES representation of the molecule (e.g.,`C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]`).
49
+ - **Example input file:** `examples/input/examples_molecules.tsv`
 
 
 
50
 
51
  ### Model
52
 
53
+ - **Modality:** Probability of microbial growth inhibition.
54
+ - **Scale:** Combination of bacterial strain and compound.
55
+ - **Description:**
56
+ - The model computes antimicrobial predictive probabilities for **40 bacterial strains** contained in [Maier et al., 2018](https://www.nature.com/articles/nature25979), using a two-step process:
57
+ 1. Generate molecular embeddings with a pre-trained **GINet** representation model (`model.pth`, `config.yaml`).
58
+ 2. Predict antimicrobial properties with a **trained XGBoost** classifier (`MolE-XGBoost-08.03.2024_14.20.pkl`).
59
+ - The scores reflect the likelihood that a compound inhibits the growth of a given bacterial strain.
60
+ - **Training data:** 100000 randomly sampled compounds from ChemBERTa for the pretraining of MolE and data from *Maier et al., 2018* containing the influence of 1197 marketed drugs on the growth of 40 bacterial strains for the XGBoost classifier
61
+ - **Publication:** [Nature Communications (2025)](https://www.nature.com/articles/s41467-025-58804-4)
 
 
 
62
 
63
  ### Output
64
 
65
+ - **Description:** For each compound, the model predicts growth inhibition scores for 40 different bacterial strains.
66
+ - **Output format:** table
67
+ - **Shape:** `[n × 40, 2]`
68
+ - **Columns:**
69
+ - `pred_id` *(str)*: Combination of the molecule name and bacterial strain (e.g., `Halicin:Akkermansia muciniphila (NT5021)`).
70
+ - `antimicrobial_predictive_probability` *(float)*: Predicted probability that the compound inhibits microbial growth.
71
+ - **Example output file:** `examples/output/example_molecules_prediction.tsv`
 
 
 
 
72
 
73
  ## Installation
74