File size: 13,406 Bytes
bab843a
49240cf
 
 
 
 
 
 
 
 
 
 
bab843a
 
49240cf
 
bab843a
49240cf
bab843a
49240cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc229ab
49240cf
dc229ab
49240cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a571c9
49240cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
language: en
library_name: transformers
license: apache-2.0
tags:
- mist
- chemistry
- molecular-property-prediction
title: ' MIST: Molecular Insight SMILES Transformers'
sdk: streamlit
emoji: 🚀
colorFrom: indigo
colorTo: purple
pinned: false
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/672fec35d68675461e02d9ab/NqT2SG2Ox5Z1bpGKNjXBP.png
---
#  MIST: Molecular Insight SMILES Transformers

MIST is a family of molecular foundation models for molecular property prediction.
The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.

## Model Details

- **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
- **Pre-training**: Masked Language Modeling on molecular SMILES
- **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer



## Model Inputs and Outputs

### Inputs
- **SMILES strings**: Standard SMILES notation for molecular structures
- **Batch size**: Variable, automatically padded during inference

### Outputs
- **Predictions**: Task-specific numerical or categorical predictions
- **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output


## Quick Start

Tutorials are available in Google Colab:
- [Inference](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/molecular_property_prediction.ipynb)
- [Finetuning](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/run_finetuning.ipynb)
  
#### Running Locally

To run the model locally, create a virtual environment and install dependencies:

```bash
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
> **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.


Use the model! 
For a full list of model IDs and properties see the list of provided models below.
For details on the specific inputs and outputs formats for each model variant see the model card.

```python
from transformers import AutoModel
from smirk import SmirkTokenizerFast

# Load the model
model = AutoModel.from_pretrained(
    "mist-models/mist-{size}-{model_id}-{property}",
    trust_remote_code=True
)

# Make predictions
smiles_batch = [
    "CCO",           # Ethanol
    "CC(=O)O",       # Acetic acid
    "C1=CC=CC=C1"       # Benzene
]
results = model.predict(smiles_batch)
```

## Provided Models

### Pre-trained
- [`mist-1.8B-dh61satt`](https://huggingface.co/mist-models/mist-1.8B-dh61satt): Flagship MIST model (MIST-1.8B)
- [`mist-28M-ti624ev1`](https://huggingface.co/mist-models/mist-28M-ti624ev1) **: Smaller MIST model (MIST-28M).

`**` Indicates publically released models.
Below is a full list of finetuned variants hosted on HuggingFace:
### MoleculeNet Benchmark Models

| Folder                                                                 | Encoder   | Dataset                   |
| ---------------------------------------------------------------------- | :-------: | ------------------------- |
| [mist-1.8B-fbdn8e35-bbbp](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp)      | MIST-1.8B | MoleculeNet BBBP          |
| [mist-1.8B-1a4puhg2-hiv](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv)            | MIST-1.8B | MoleculeNet HIV           |
| [mist-1.8B-m50jgolp-bace](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace)          | MIST-1.8B | MoleculeNet BACE          |
| [mist-1.8B-uop1z0dc-tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21)        | MIST-1.8B | MoleculeNet Tox21         |
| [mist-1.8B-lu1l5ieh-clintox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox)    | MIST-1.8B | MoleculeNet ClinTox       |
| mist-1.8B-l1wfo7oa-sider *      | MIST-1.8B | MoleculeNet SIDER.        |
| mist-1.8B-hxiygjsm-esol *        | MIST-1.8B | MoleculeNet ESOL          |
| [mist-1.8B-iwqj2cld-freesolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv)  | MIST-1.8B | MoleculeNet FreeSolv      |
| [mist-1.8B-jvt4azpz-lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo)          | MIST-1.8B | MoleculeNet Lipophilicity |
| [mist-1.8B-8nd1ot5j-qm8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8)            | MIST-1.8B | MoleculeNet QM8           |
| [mist-28M-3xpfhv48-bbbp](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp)  **         | MIST-28M  | MoleculeNet BBBP          |
| [mist-28M-8fh43gke-hiv](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv) **            | MIST-28M  | MoleculeNet HIV           |
| [mist-28M-8loj3bab-bace](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace)  **         | MIST-28M  | MoleculeNet BACE          |
| [mist-28M-kw4ks27p-tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21)  **       | MIST-28M  | MoleculeNet Tox21         |
| [mist-28M-97vfcykk-clintox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox)  **   | MIST-28M  | MoleculeNet ClinTox       |
| [mist-28M-z8qo16uy-sider](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider) **       | MIST-28M  | MoleculeNet SIDER         |
| [mist-28M-kcwb9le5-esol](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol)  **          | MIST-28M  | MoleculeNet ESOL          |
| [mist-28M-0uiq7o7m-freesolv](https://huggingface.co/mist-models/mist-28M-0uiq7o7m-freesolv) **  | MIST-28M  | MoleculeNet FreeSolv      |
| [mist-28M-xzr5ulva-lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo)  **         | MIST-28M  | MoleculeNet Lipophilicity |
| [mist-28M-gzwqzpcr-qm8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8) **            | MIST-28M  | MoleculeNet QM8           |
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9)  **       | MIST-28M  | MoleculeNet QM9           |


`**` Indicates publically released models.
`*` Indicates models currently not available on hugging-face due to storage limits.

#### QM9 Benchmark Models
The single target (MIST-1.8B encoder) models for properties in QM9 are available.

| Folder                                                                 | Encoder   | Target                                                            |
| ---------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------- |
| [mist-1.8B-ez05expv-mu](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu)               | MIST-1.8B | μ - Dipole moment (unit: D)                                       |
| mist-1.8B-rcwary93-alpha *                                             | MIST-1.8B | α - Isotropic polarizability (unit: Bohr^3)                       |
| mist-1.8B-jmjosq12-homo *                                              | MIST-1.8B | HOMO - Highest occupied molecular orbital energy (unit: Hartree)  |
| mist-1.8B-n14wshc9-lumo *                                              | MIST-1.8B | LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
| mist-1.8B-kayun6v3-gap *                                               | MIST-1.8B | Gap - Gap between HOMO and LUMO (unit: Hartree)                   |
| mist-1.8B-xxe7t35e-r2 *                                                | MIST-1.8B | \<R2\> - Electronic spatial extent (unit: Bohr^2)                 |
| [mist-1.8B-6nmcwyrp-zpve](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve)          | MIST-1.8B | ZPVE - Zero point vibrational energy (unit: Hartree)              |
| [mist-1.8B-a7akimjj-u0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0)              | MIST-1.8B | U0 - Internal energy at 0K (unit: Hartree)                        |
| [mist-1.8B-85f24xkj-u298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298)          | MIST-1.8B | U298 - Internal energy at 298.15K (unit: Hartree)                 |
| [mist-1.8B-3fbbz4is-h298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298)          | MIST-1.8B | H298 - Enthalpy at 298.15K (unit: Hartree)                        |
| [mist-1.8B-09sntn03-g298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298)          | MIST-1.8B | G298 - Free energy at 298.15K (unit: Hartree)                     |
| [mist-1.8B-j356b3nf-cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv)              | MIST-1.8B | Cv - Heat capacity at 298.15K (unit: cal/(mol*K))                 |

`*` Indicates models currently not available on hugging-face due to storage limits

### Finetuned Single Task Models

These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.

| Folder                                                                 | Encoder  | Dataset                                                     |
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------------------- |
| [mist-26.9M-48kpooqf-odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour)       | MIST-28M | Olfaction                                                   |
| [mist-26.9M-6hk5coof-dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn)             | MIST-28M | Donor Number                                                |
| [mist-26.9M-0vxdbm36-kt](https://huggingface.co/mist-models/mist-26.9M-0vxdbm36-kt)             | MIST-28M | Kamlet-Taft Solvochromatic Parameters                       |
| [mist-26.9M-b302p09x-bp](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp)             | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
| [mist-26.9M-cyuo2xb6-fp](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp)             | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset)   |
| [mist-26.9M-y3ge5pf9-mp](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp)             | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |

### Finetuned Multi-Task Models
These are additional multi-target finetuned models consisting of a MIST encoder and task network.

| Folder                                                                 | Encoder  | Dataset                               |
| ---------------------------------------------------------------------- | :------: | ------------------------------------- |
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9)            | MIST-28M | QM9 Dataset with SMILES randomization |
| [mist-28M-ttqcvt6fs-toxcast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast)      | MIST-28M | ToxCast                               |
| [mist-28M-yr1urd2c-muv](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv)                | MIST-28M | Maximum Unbiased Validation (MUV)     |
| [mist-models/mist-28M-ggd8iisr-tmQM](https://huggingface.co/mist-models/mist-models/mist-28M-ggd8iisr-tmQM) ** | MIST-28M | QM properties of transition metal orgaomettallics     |

`**` Indicates publically released models.

### Finetuned Mixture Models

These models consist of a MIST-encoder and physics informed task network for mixture property prediction.

| Folder                                                                 | Encoder  | Dataset                                         |
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------- |
| [mist-conductivity-28M-2mpg8dcd](https://huggingface.co/mist-models/mist-conductivity-28M-2mpg8dcd) | MIST-28M | Ionic Conductivity                              |
| [mist-mixtures-zffffbex](https://huggingface.co/mist-models/mist-mixtures-zffffbex)              | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |

## Citation

If you use this model in your research, please cite:

```bibtex
@online{MIST,
  title = {Foundation Models for Discovery and Exploration in Chemical Space},
  author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
  date = {2025-10-20},
  eprint = {2510.18900},
  eprinttype = {arXiv},
  eprintclass = {physics},
  doi = {10.48550/arXiv.2510.18900},
  url = {http://arxiv.org/abs/2510.18900},  
}
```

## License and Notice

Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.

**Restrictions:**
- Research use only
- No redistribution without permission
- No commercial use without licensing agreement

For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).

<hr>