anoushka2000 commited on
Commit
9f072eb
·
verified ·
1 Parent(s): 21201da

make instruction on org card more explicit

Browse files
Files changed (1) hide show
  1. README.md +93 -82
README.md CHANGED
@@ -27,7 +27,39 @@ The models were pre-trained on SMILES strings from the [Enamine REAL Space](http
27
  - **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
28
 
29
 
30
- ### Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ```python
33
  from transformers import AutoModel
@@ -35,7 +67,7 @@ from smirk import SmirkTokenizerFast
35
 
36
  # Load the model
37
  model = AutoModel.from_pretrained(
38
- "path/to/model",
39
  trust_remote_code=True
40
  )
41
 
@@ -48,82 +80,59 @@ smiles_batch = [
48
  results = model.predict(smiles_batch)
49
  ```
50
 
51
- ### Setting Up Your Environment
52
-
53
- Create a virtual environment and install dependencies:
54
-
55
- ```bash
56
- python -m venv .venv
57
- source .venv/bin/activate # On Windows: .venv\Scripts\activate
58
- pip install -r requirements.txt
59
- ```
60
-
61
- > **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
62
-
63
-
64
- ## Model Inputs and Outputs
65
-
66
- ### Inputs
67
- - **SMILES strings**: Standard SMILES notation for molecular structures
68
- - **Batch size**: Variable, automatically padded during inference
69
-
70
- ### Outputs
71
- - **Predictions**: Task-specific numerical or categorical predictions
72
- - **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
73
-
74
-
75
  ## Provided Models
76
 
77
  ### Pre-trained
78
- - `mist-1.8B-dh61satt`: Flagship MIST model (MIST-1.8B)
79
- - `mist-28M-ti624ev1`: Smaller MIST model (MIST-28M).
80
 
81
  Below is a full list of finetuned variants hosted on HuggingFace:
82
  ### MoleculeNet Benchmark Models
83
 
84
- | Folder | Encoder | Dataset |
85
- | ---------------------------- | :------: | ------------------------------------ |
86
- | mist-1.8B-fbdn8e35-bbbp | MIST-1.8B| MoleculeNet BBBP |
87
- | mist-1.8B-1a4puhg2-hiv | MIST-1.8B| MoleculeNet HIV |
88
- | mist-1.8B-m50jgolp-bace | MIST-1.8B| MoleculeNet BACE |
89
- | mist-1.8B-uop1z0dc-tox21 | MIST-1.8B| MoleculeNet Tox21 |
90
- | mist-1.8B-lu1l5ieh-clintox | MIST-1.8B| MoleculeNet ClinTox |
91
- | mist-1.8B-l1wfo7oa-sider * | MIST-1.8B| MoleculeNet SIDER. |
92
- | mist-1.8B-hxiygjsm-esol * | MIST-1.8B| MoleculeNet ESOL |
93
- | mist-1.8B-iwqj2cld-freesolv | MIST-1.8B| MoleculeNet FreeSolv |
94
- | mist-1.8B-jvt4azpz-lipo | MIST-1.8B| MoleculeNet Lipophilicity |
95
- | mist-1.8B-8nd1ot5j-qm8 | MIST-1.8B| MoleculeNet QM8 |
96
- | mist-28M-3xpfhv48-bbbp | MIST-28M | MoleculeNet BBBP |
97
- | mist-28M-8fh43gke-hiv | MIST-28M | MoleculeNet HIV |
98
- | mist-28M-8loj3bab-bace | MIST-28M | MoleculeNet BACE |
99
- | mist-28M-kw4ks27p-tox21 | MIST-28M | MoleculeNet Tox21 |
100
- | mist-28M-97vfcykk-clintox | MIST-28M | MoleculeNet ClinTox |
101
- | mist-28M-z8qo16uy-sider | MIST-28M | MoleculeNet SIDER |
102
- | mist-28M-kcwb9le5-esol | MIST-28M | MoleculeNet ESOL |
103
- | mist-28M-0uiq7o7m-freesolv * | MIST-28M | MoleculeNet FreeSolv |
104
- | mist-28M-xzr5ulva-lipo | MIST-28M | MoleculeNet Lipophilicity |
105
- | mist-28M-gzwqzpcr-qm8 | MIST-28M | MoleculeNet QM8 |
106
- | mist-26.9M-kkgx0omx-qm9 | MIST-28M | MoleculeNet QM9 |
 
107
 
108
  `*` Indicates models currently not available on hugging-face due to storage limits
109
 
110
  #### QM9 Benchmark Models
111
  The single target (MIST-1.8B encoder) models for properties in QM9 are available.
112
 
113
- | Folder | Encoder | Target |
114
- | ---------------------------- | :------: | ----------------------------------------------------------------- |
115
- | mist-1.8B-ez05expv-mu | MIST-1.8B| μ - Dipole moment (unit: D) |
116
- | mist-1.8B-rcwary93-alpha * | MIST-1.8B| α - Isotropic polarizability (unit: Bohr^3) |
117
- | mist-1.8B-jmjosq12-homo * | MIST-1.8B| HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
118
- | mist-1.8B-n14wshc9-lumo * | MIST-1.8B| LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
119
- | mist-1.8B-kayun6v3-gap * | MIST-1.8B| Gap - Gap between HOMO and LUMO (unit: Hartree) |
120
- | mist-1.8B-xxe7t35e-r2 * | MIST-1.8B| \<R2\> - Electronic spatial extent (unit: Bohr^2) |
121
- | mist-1.8B-6nmcwyrp-zpve | MIST-1.8B| ZPVE - Zero point vibrational energy (unit: Hartree) |
122
- | mist-1.8B-a7akimjj-u0 | MIST-1.8B| U0 - Internal energy at 0K (unit: Hartree) |
123
- | mist-1.8B-85f24xkj-u298 | MIST-1.8B| U298 - Internal energy at 298.15K (unit: Hartree) |
124
- | mist-1.8B-3fbbz4is-h298 | MIST-1.8B| H298 - Enthalpy at 298.15K (unit: Hartree) |
125
- | mist-1.8B-09sntn03-g298 | MIST-1.8B| G298 - Free energy at 298.15K (unit: Hartree) |
126
- | mist-1.8B-j356b3nf-cv | MIST-1.8B| Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
127
 
128
  `*` Indicates models currently not available on hugging-face due to storage limits
129
 
@@ -131,30 +140,32 @@ The single target (MIST-1.8B encoder) models for properties in QM9 are available
131
 
132
  These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
133
 
134
- | Folder | Encoder | Dataset |
135
- | ------------------------- | :------: | ----------------------------------------------------------- |
136
- | mist-26.9M-48kpooqf-odour | MIST-28M | Olfaction |
137
- | mist-26.9M-6hk5coof-dn | MIST-28M | Donor Number |
138
- | mist-26.9M-0vxdbm36-kt | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
139
- | mist-26.9M-b302p09x-bp | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
140
- | mist-26.9M-cyuo2xb6-fp | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
141
- | mist-26.9M-y3ge5pf9-mp | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
142
 
143
  ### Finetuned Multi-Task Models
144
  These are additional multi-target finetuned models consisting of a MIST encoder and task network.
145
- | Folder | Encoder | Dataset |
146
- | -------------------------- | :------: | ----------------------------------------------------------- |
147
- | mist-26.9M-kkgx0omx-qm9 | MIST-28M | QM9 Dataset with SMILES randomization |
148
- | mist-28M-ttqcvt6fs-toxcast | MIST-28M | ToxCast |
149
- | mist-28M-yr1urd2c-muv | MIST-28M | Maximum Unbiased Validation (MUV) |
 
150
 
151
  ### Finetuned Mixture Models
152
 
153
  These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
154
- | Folder | Encoder | Dataset |
155
- | -------------------------------- | :------: | ----------------------------------------------------------- |
156
- | mist-conductivity-28M-2mpg8dcd | MIST-28M | Ionic Conductivity |
157
- | mist-mixtures-zffffbex | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
 
158
 
159
  ## Citation
160
 
 
27
  - **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
28
 
29
 
30
+
31
+ ## Model Inputs and Outputs
32
+
33
+ ### Inputs
34
+ - **SMILES strings**: Standard SMILES notation for molecular structures
35
+ - **Batch size**: Variable, automatically padded during inference
36
+
37
+ ### Outputs
38
+ - **Predictions**: Task-specific numerical or categorical predictions
39
+ - **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
40
+
41
+
42
+ ## Quick Start
43
+
44
+ Tutorials are available in Google Colab:
45
+ - [Inference](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/molecular_property_prediction.ipynb)
46
+ - [Finetuning](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/run_finetuning.ipynb)
47
+
48
+ #### Running Locally
49
+
50
+ To run the model locally, create a virtual environment and install dependencies:
51
+
52
+ ```bash
53
+ python -m venv .venv
54
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
55
+ pip install -r requirements.txt
56
+ ```
57
+ > **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
58
+
59
+
60
+ Use the model!
61
+ For a full list of model IDs and properties see the list of provided models below.
62
+ For details on the specific inputs and outputs formats for each model variant see the model card.
63
 
64
  ```python
65
  from transformers import AutoModel
 
67
 
68
  # Load the model
69
  model = AutoModel.from_pretrained(
70
+ "mist-models/mist-{size}-{model_id}-{property}",
71
  trust_remote_code=True
72
  )
73
 
 
80
  results = model.predict(smiles_batch)
81
  ```
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ## Provided Models
84
 
85
  ### Pre-trained
86
+ - [`mist-1.8B-dh61satt`](https://huggingface.co/mist-models/mist-1.8B-dh61satt): Flagship MIST model (MIST-1.8B)
87
+ - [`mist-28M-ti624ev1`](https://huggingface.co/mist-models/mist-28M-ti624ev1): Smaller MIST model (MIST-28M).
88
 
89
  Below is a full list of finetuned variants hosted on HuggingFace:
90
  ### MoleculeNet Benchmark Models
91
 
92
+ | Folder | Encoder | Dataset |
93
+ | ---------------------------------------------------------------------- | :-------: | ------------------------- |
94
+ | [mist-1.8B-fbdn8e35-bbbp](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp) | MIST-1.8B | MoleculeNet BBBP |
95
+ | [mist-1.8B-1a4puhg2-hiv](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv) | MIST-1.8B | MoleculeNet HIV |
96
+ | [mist-1.8B-m50jgolp-bace](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace) | MIST-1.8B | MoleculeNet BACE |
97
+ | [mist-1.8B-uop1z0dc-tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21) | MIST-1.8B | MoleculeNet Tox21 |
98
+ | [mist-1.8B-lu1l5ieh-clintox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox) | MIST-1.8B | MoleculeNet ClinTox |
99
+ | mist-1.8B-l1wfo7oa-sider * | MIST-1.8B | MoleculeNet SIDER. |
100
+ | mist-1.8B-hxiygjsm-esol * | MIST-1.8B | MoleculeNet ESOL |
101
+ | [mist-1.8B-iwqj2cld-freesolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv) | MIST-1.8B | MoleculeNet FreeSolv |
102
+ | [mist-1.8B-jvt4azpz-lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo) | MIST-1.8B | MoleculeNet Lipophilicity |
103
+ | [mist-1.8B-8nd1ot5j-qm8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8) | MIST-1.8B | MoleculeNet QM8 |
104
+ | [mist-28M-3xpfhv48-bbbp](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp) | MIST-28M | MoleculeNet BBBP |
105
+ | [mist-28M-8fh43gke-hiv](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv) | MIST-28M | MoleculeNet HIV |
106
+ | [mist-28M-8loj3bab-bace](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace) | MIST-28M | MoleculeNet BACE |
107
+ | [mist-28M-kw4ks27p-tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21) | MIST-28M | MoleculeNet Tox21 |
108
+ | [mist-28M-97vfcykk-clintox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox) | MIST-28M | MoleculeNet ClinTox |
109
+ | [mist-28M-z8qo16uy-sider](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider) | MIST-28M | MoleculeNet SIDER |
110
+ | [mist-28M-kcwb9le5-esol](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol) | MIST-28M | MoleculeNet ESOL |
111
+ | mist-28M-0uiq7o7m-freesolv * | MIST-28M | MoleculeNet FreeSolv |
112
+ | [mist-28M-xzr5ulva-lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo) | MIST-28M | MoleculeNet Lipophilicity |
113
+ | [mist-28M-gzwqzpcr-qm8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8) | MIST-28M | MoleculeNet QM8 |
114
+ | [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) | MIST-28M | MoleculeNet QM9 |
115
+
116
 
117
  `*` Indicates models currently not available on hugging-face due to storage limits
118
 
119
  #### QM9 Benchmark Models
120
  The single target (MIST-1.8B encoder) models for properties in QM9 are available.
121
 
122
+ | Folder | Encoder | Target |
123
+ | ---------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------- |
124
+ | [mist-1.8B-ez05expv-mu](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu) | MIST-1.8B | μ - Dipole moment (unit: D) |
125
+ | mist-1.8B-rcwary93-alpha * | MIST-1.8B | α - Isotropic polarizability (unit: Bohr^3) |
126
+ | mist-1.8B-jmjosq12-homo * | MIST-1.8B | HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
127
+ | mist-1.8B-n14wshc9-lumo * | MIST-1.8B | LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
128
+ | mist-1.8B-kayun6v3-gap * | MIST-1.8B | Gap - Gap between HOMO and LUMO (unit: Hartree) |
129
+ | mist-1.8B-xxe7t35e-r2 * | MIST-1.8B | \<R2\> - Electronic spatial extent (unit: Bohr^2) |
130
+ | [mist-1.8B-6nmcwyrp-zpve](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve) | MIST-1.8B | ZPVE - Zero point vibrational energy (unit: Hartree) |
131
+ | [mist-1.8B-a7akimjj-u0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0) | MIST-1.8B | U0 - Internal energy at 0K (unit: Hartree) |
132
+ | [mist-1.8B-85f24xkj-u298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298) | MIST-1.8B | U298 - Internal energy at 298.15K (unit: Hartree) |
133
+ | [mist-1.8B-3fbbz4is-h298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298) | MIST-1.8B | H298 - Enthalpy at 298.15K (unit: Hartree) |
134
+ | [mist-1.8B-09sntn03-g298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298) | MIST-1.8B | G298 - Free energy at 298.15K (unit: Hartree) |
135
+ | [mist-1.8B-j356b3nf-cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv) | MIST-1.8B | Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
136
 
137
  `*` Indicates models currently not available on hugging-face due to storage limits
138
 
 
140
 
141
  These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
142
 
143
+ | Folder | Encoder | Dataset |
144
+ | ---------------------------------------------------------------------- | :------: | ----------------------------------------------------------- |
145
+ | [mist-26.9M-48kpooqf-odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour) | MIST-28M | Olfaction |
146
+ | [mist-26.9M-6hk5coof-dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn) | MIST-28M | Donor Number |
147
+ | [mist-26.9M-0vxdbm36-kt](https://huggingface.co/mist-models/mist-26.9M-0vxdbm36-kt) | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
148
+ | [mist-26.9M-b302p09x-bp](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp) | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
149
+ | [mist-26.9M-cyuo2xb6-fp](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp) | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
150
+ | [mist-26.9M-y3ge5pf9-mp](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp) | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
151
 
152
  ### Finetuned Multi-Task Models
153
  These are additional multi-target finetuned models consisting of a MIST encoder and task network.
154
+
155
+ | Folder | Encoder | Dataset |
156
+ | ---------------------------------------------------------------------- | :------: | ------------------------------------- |
157
+ | [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) | MIST-28M | QM9 Dataset with SMILES randomization |
158
+ | [mist-28M-ttqcvt6fs-toxcast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast) | MIST-28M | ToxCast |
159
+ | [mist-28M-yr1urd2c-muv](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv) | MIST-28M | Maximum Unbiased Validation (MUV) |
160
 
161
  ### Finetuned Mixture Models
162
 
163
  These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
164
+
165
+ | Folder | Encoder | Dataset |
166
+ | ---------------------------------------------------------------------- | :------: | ----------------------------------------------- |
167
+ | [mist-conductivity-28M-2mpg8dcd](https://huggingface.co/mist-models/mist-conductivity-28M-2mpg8dcd) | MIST-28M | Ionic Conductivity |
168
+ | [mist-mixtures-zffffbex](https://huggingface.co/mist-models/mist-mixtures-zffffbex) | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
169
 
170
  ## Citation
171