anoushka2000 commited on
Commit
49240cf
·
verified ·
1 Parent(s): bab843a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -5
README.md CHANGED
@@ -1,10 +1,202 @@
1
  ---
2
- title: README
3
- emoji: 🏢
4
- colorFrom: purple
 
 
 
 
 
 
 
 
5
  colorTo: purple
6
- sdk: static
7
  pinned: false
 
 
8
  ---
 
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ tags:
6
+ - mist
7
+ - chemistry
8
+ - molecular-property-prediction
9
+ title: ' MIST: Molecular Insight SMILES Transformers'
10
+ sdk: streamlit
11
+ emoji: 🚀
12
+ colorFrom: indigo
13
  colorTo: purple
 
14
  pinned: false
15
+ thumbnail: >-
16
+ https://cdn-uploads.huggingface.co/production/uploads/672fec35d68675461e02d9ab/NqT2SG2Ox5Z1bpGKNjXBP.png
17
  ---
18
+ # MIST: Molecular Insight SMILES Transformers
19
 
20
+ MIST is a family of molecular foundation models for molecular property prediction.
21
+ The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
22
+
23
+ ## Model Details
24
+
25
+ - **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
26
+ - **Pre-training**: Masked Language Modeling on molecular SMILES
27
+ - **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
28
+
29
+
30
+
31
+ ## Model Inputs and Outputs
32
+
33
+ ### Inputs
34
+ - **SMILES strings**: Standard SMILES notation for molecular structures
35
+ - **Batch size**: Variable, automatically padded during inference
36
+
37
+ ### Outputs
38
+ - **Predictions**: Task-specific numerical or categorical predictions
39
+ - **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
40
+
41
+
42
+ ## Quick Start
43
+
44
+ Tutorials are available in Google Colab:
45
+ - [Inference](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/molecular_property_prediction.ipynb)
46
+ - [Finetuning](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/run_finetuning.ipynb)
47
+
48
+ #### Running Locally
49
+
50
+ To run the model locally, create a virtual environment and install dependencies:
51
+
52
+ ```bash
53
+ python -m venv .venv
54
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
55
+ pip install -r requirements.txt
56
+ ```
57
+ > **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
58
+
59
+
60
+ Use the model!
61
+ For a full list of model IDs and properties see the list of provided models below.
62
+ For details on the specific inputs and outputs formats for each model variant see the model card.
63
+
64
+ ```python
65
+ from transformers import AutoModel
66
+ from smirk import SmirkTokenizerFast
67
+
68
+ # Load the model
69
+ model = AutoModel.from_pretrained(
70
+ "mist-models/mist-{size}-{model_id}-{property}",
71
+ trust_remote_code=True
72
+ )
73
+
74
+ # Make predictions
75
+ smiles_batch = [
76
+ "CCO", # Ethanol
77
+ "CC(=O)O", # Acetic acid
78
+ "C1=CC=CC=C1" # Benzene
79
+ ]
80
+ results = model.predict(smiles_batch)
81
+ ```
82
+
83
+ ## Provided Models
84
+
85
+ ### Pre-trained
86
+ - [`mist-1.8B-dh61satt`](https://huggingface.co/mist-models/mist-1.8B-dh61satt): Flagship MIST model (MIST-1.8B)
87
+ - [`mist-28M-ti624ev1`](https://huggingface.co/mist-models/mist-28M-ti624ev1): Smaller MIST model (MIST-28M).
88
+
89
+ Below is a full list of finetuned variants hosted on HuggingFace:
90
+ ### MoleculeNet Benchmark Models
91
+
92
+ | Folder | Encoder | Dataset |
93
+ | ---------------------------------------------------------------------- | :-------: | ------------------------- |
94
+ | [mist-1.8B-fbdn8e35-bbbp](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp) | MIST-1.8B | MoleculeNet BBBP |
95
+ | [mist-1.8B-1a4puhg2-hiv](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv) | MIST-1.8B | MoleculeNet HIV |
96
+ | [mist-1.8B-m50jgolp-bace](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace) | MIST-1.8B | MoleculeNet BACE |
97
+ | [mist-1.8B-uop1z0dc-tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21) | MIST-1.8B | MoleculeNet Tox21 |
98
+ | [mist-1.8B-lu1l5ieh-clintox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox) | MIST-1.8B | MoleculeNet ClinTox |
99
+ | mist-1.8B-l1wfo7oa-sider * | MIST-1.8B | MoleculeNet SIDER. |
100
+ | mist-1.8B-hxiygjsm-esol * | MIST-1.8B | MoleculeNet ESOL |
101
+ | [mist-1.8B-iwqj2cld-freesolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv) | MIST-1.8B | MoleculeNet FreeSolv |
102
+ | [mist-1.8B-jvt4azpz-lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo) | MIST-1.8B | MoleculeNet Lipophilicity |
103
+ | [mist-1.8B-8nd1ot5j-qm8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8) | MIST-1.8B | MoleculeNet QM8 |
104
+ | [mist-28M-3xpfhv48-bbbp](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp) ** | MIST-28M | MoleculeNet BBBP |
105
+ | [mist-28M-8fh43gke-hiv](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv) ** | MIST-28M | MoleculeNet HIV |
106
+ | [mist-28M-8loj3bab-bace](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace) ** | MIST-28M | MoleculeNet BACE |
107
+ | [mist-28M-kw4ks27p-tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21) ** | MIST-28M | MoleculeNet Tox21 |
108
+ | [mist-28M-97vfcykk-clintox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox) ** | MIST-28M | MoleculeNet ClinTox |
109
+ | [mist-28M-z8qo16uy-sider](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider) ** | MIST-28M | MoleculeNet SIDER |
110
+ | [mist-28M-kcwb9le5-esol](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol) ** | MIST-28M | MoleculeNet ESOL |
111
+ | mist-28M-0uiq7o7m-freesolv * | MIST-28M | MoleculeNet FreeSolv |
112
+ | [mist-28M-xzr5ulva-lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo) ** | MIST-28M | MoleculeNet Lipophilicity |
113
+ | [mist-28M-gzwqzpcr-qm8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8) ** | MIST-28M | MoleculeNet QM8 |
114
+ | [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) ** | MIST-28M | MoleculeNet QM9 |
115
+
116
+
117
+ `**` Indicates publically released models.
118
+ `*` Indicates models currently not available on hugging-face due to storage limits.
119
+
120
+ #### QM9 Benchmark Models
121
+ The single target (MIST-1.8B encoder) models for properties in QM9 are available.
122
+
123
+ | Folder | Encoder | Target |
124
+ | ---------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------- |
125
+ | [mist-1.8B-ez05expv-mu](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu) | MIST-1.8B | μ - Dipole moment (unit: D) |
126
+ | mist-1.8B-rcwary93-alpha * | MIST-1.8B | α - Isotropic polarizability (unit: Bohr^3) |
127
+ | mist-1.8B-jmjosq12-homo * | MIST-1.8B | HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
128
+ | mist-1.8B-n14wshc9-lumo * | MIST-1.8B | LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
129
+ | mist-1.8B-kayun6v3-gap * | MIST-1.8B | Gap - Gap between HOMO and LUMO (unit: Hartree) |
130
+ | mist-1.8B-xxe7t35e-r2 * | MIST-1.8B | \<R2\> - Electronic spatial extent (unit: Bohr^2) |
131
+ | [mist-1.8B-6nmcwyrp-zpve](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve) | MIST-1.8B | ZPVE - Zero point vibrational energy (unit: Hartree) |
132
+ | [mist-1.8B-a7akimjj-u0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0) | MIST-1.8B | U0 - Internal energy at 0K (unit: Hartree) |
133
+ | [mist-1.8B-85f24xkj-u298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298) | MIST-1.8B | U298 - Internal energy at 298.15K (unit: Hartree) |
134
+ | [mist-1.8B-3fbbz4is-h298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298) | MIST-1.8B | H298 - Enthalpy at 298.15K (unit: Hartree) |
135
+ | [mist-1.8B-09sntn03-g298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298) | MIST-1.8B | G298 - Free energy at 298.15K (unit: Hartree) |
136
+ | [mist-1.8B-j356b3nf-cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv) | MIST-1.8B | Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
137
+
138
+ `*` Indicates models currently not available on hugging-face due to storage limits
139
+
140
+ ### Finetuned Single Task Models
141
+
142
+ These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
143
+
144
+ | Folder | Encoder | Dataset |
145
+ | ---------------------------------------------------------------------- | :------: | ----------------------------------------------------------- |
146
+ | [mist-26.9M-48kpooqf-odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour) | MIST-28M | Olfaction |
147
+ | [mist-26.9M-6hk5coof-dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn) | MIST-28M | Donor Number |
148
+ | [mist-26.9M-0vxdbm36-kt](https://huggingface.co/mist-models/mist-26.9M-0vxdbm36-kt) | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
149
+ | [mist-26.9M-b302p09x-bp](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp) | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
150
+ | [mist-26.9M-cyuo2xb6-fp](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp) | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
151
+ | [mist-26.9M-y3ge5pf9-mp](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp) | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
152
+
153
+ ### Finetuned Multi-Task Models
154
+ These are additional multi-target finetuned models consisting of a MIST encoder and task network.
155
+
156
+ | Folder | Encoder | Dataset |
157
+ | ---------------------------------------------------------------------- | :------: | ------------------------------------- |
158
+ | [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) | MIST-28M | QM9 Dataset with SMILES randomization |
159
+ | [mist-28M-ttqcvt6fs-toxcast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast) | MIST-28M | ToxCast |
160
+ | [mist-28M-yr1urd2c-muv](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv) | MIST-28M | Maximum Unbiased Validation (MUV) |
161
+ | [mist-models/mist-28M-ggd8iisr-tmQM](https://huggingface.co/mist-models/mist-models/mist-28M-ggd8iisr-tmQM) ** | MIST-28M | QM properties of transition metal orgaomettallics |
162
+
163
+ `**` Indicates publically released models.
164
+
165
+ ### Finetuned Mixture Models
166
+
167
+ These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
168
+
169
+ | Folder | Encoder | Dataset |
170
+ | ---------------------------------------------------------------------- | :------: | ----------------------------------------------- |
171
+ | [mist-conductivity-28M-2mpg8dcd](https://huggingface.co/mist-models/mist-conductivity-28M-2mpg8dcd) | MIST-28M | Ionic Conductivity |
172
+ | [mist-mixtures-zffffbex](https://huggingface.co/mist-models/mist-mixtures-zffffbex) | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
173
+
174
+ ## Citation
175
+
176
+ If you use this model in your research, please cite:
177
+
178
+ ```bibtex
179
+ @online{MIST,
180
+ title = {Foundation Models for Discovery and Exploration in Chemical Space},
181
+ author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
182
+ date = {2025-10-20},
183
+ eprint = {2510.18900},
184
+ eprinttype = {arXiv},
185
+ eprintclass = {physics},
186
+ doi = {10.48550/arXiv.2510.18900},
187
+ url = {http://arxiv.org/abs/2510.18900},
188
+ }
189
+ ```
190
+
191
+ ## License and Notice
192
+
193
+ Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.
194
+
195
+ **Restrictions:**
196
+ - Research use only
197
+ - No redistribution without permission
198
+ - No commercial use without licensing agreement
199
+
200
+ For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).
201
+
202
+ <hr>