Spaces:
Running
Running
File size: 13,293 Bytes
03eaca9 6b0b5a4 188eadb 6b0b5a4 1206354 03eaca9 fb2d5f3 6b0b5a4 9f072eb 6b0b5a4 d5e3c74 6b0b5a4 9f072eb 6b0b5a4 6aafa87 6b0b5a4 9f072eb 6b0b5a4 9f072eb 4080d61 9f072eb 4080d61 9f072eb 4080d61 9f072eb 6b0b5a4 4080d61 6b0b5a4 9f072eb 6b0b5a4 4bb2732 6b0b5a4 9f072eb 6b0b5a4 9f072eb 4080d61 6b0b5a4 9f072eb 6b0b5a4 1206354 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
language: en
library_name: transformers
license: apache-2.0
tags:
- mist
- chemistry
- molecular-property-prediction
title: ' MIST: Molecular Insight SMILES Transformers'
sdk: streamlit
emoji: 🚀
colorFrom: indigo
colorTo: purple
pinned: false
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/672fec35d68675461e02d9ab/NqT2SG2Ox5Z1bpGKNjXBP.png
---
# MIST: Molecular Insight SMILES Transformers
MIST is a family of molecular foundation models for molecular property prediction.
The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
## Model Details
- **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
- **Pre-training**: Masked Language Modeling on molecular SMILES
- **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
## Model Inputs and Outputs
### Inputs
- **SMILES strings**: Standard SMILES notation for molecular structures
- **Batch size**: Variable, automatically padded during inference
### Outputs
- **Predictions**: Task-specific numerical or categorical predictions
- **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
## Quick Start
Tutorials are available in Google Colab:
- [Inference](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/molecular_property_prediction.ipynb)
- [Finetuning](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/run_finetuning.ipynb)
#### Running Locally
To run the model locally, create a virtual environment and install dependencies:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
> **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
Use the model!
For a full list of model IDs and properties see the list of provided models below.
For details on the specific inputs and outputs formats for each model variant see the model card.
```python
from transformers import AutoModel
from smirk import SmirkTokenizerFast
# Load the model
model = AutoModel.from_pretrained(
"mist-models/mist-{size}-{model_id}-{property}",
trust_remote_code=True
)
# Make predictions
smiles_batch = [
"CCO", # Ethanol
"CC(=O)O", # Acetic acid
"C1=CC=CC=C1" # Benzene
]
results = model.predict(smiles_batch)
```
## Provided Models
### Pre-trained
- [`mist-1.8B-dh61satt`](https://huggingface.co/mist-models/mist-1.8B-dh61satt): Flagship MIST model (MIST-1.8B)
- [`mist-28M-ti624ev1`](https://huggingface.co/mist-models/mist-28M-ti624ev1): Smaller MIST model (MIST-28M).
Below is a full list of finetuned variants hosted on HuggingFace:
### MoleculeNet Benchmark Models
| Folder | Encoder | Dataset |
| ---------------------------------------------------------------------- | :-------: | ------------------------- |
| [mist-1.8B-fbdn8e35-bbbp](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp) | MIST-1.8B | MoleculeNet BBBP |
| [mist-1.8B-1a4puhg2-hiv](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv) | MIST-1.8B | MoleculeNet HIV |
| [mist-1.8B-m50jgolp-bace](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace) | MIST-1.8B | MoleculeNet BACE |
| [mist-1.8B-uop1z0dc-tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21) | MIST-1.8B | MoleculeNet Tox21 |
| [mist-1.8B-lu1l5ieh-clintox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox) | MIST-1.8B | MoleculeNet ClinTox |
| mist-1.8B-l1wfo7oa-sider * | MIST-1.8B | MoleculeNet SIDER. |
| mist-1.8B-hxiygjsm-esol * | MIST-1.8B | MoleculeNet ESOL |
| [mist-1.8B-iwqj2cld-freesolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv) | MIST-1.8B | MoleculeNet FreeSolv |
| [mist-1.8B-jvt4azpz-lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo) | MIST-1.8B | MoleculeNet Lipophilicity |
| [mist-1.8B-8nd1ot5j-qm8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8) | MIST-1.8B | MoleculeNet QM8 |
| [mist-28M-3xpfhv48-bbbp](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp) ** | MIST-28M | MoleculeNet BBBP |
| [mist-28M-8fh43gke-hiv](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv) ** | MIST-28M | MoleculeNet HIV |
| [mist-28M-8loj3bab-bace](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace) ** | MIST-28M | MoleculeNet BACE |
| [mist-28M-kw4ks27p-tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21) ** | MIST-28M | MoleculeNet Tox21 |
| [mist-28M-97vfcykk-clintox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox) ** | MIST-28M | MoleculeNet ClinTox |
| [mist-28M-z8qo16uy-sider](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider) ** | MIST-28M | MoleculeNet SIDER |
| [mist-28M-kcwb9le5-esol](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol) ** | MIST-28M | MoleculeNet ESOL |
| mist-28M-0uiq7o7m-freesolv * | MIST-28M | MoleculeNet FreeSolv |
| [mist-28M-xzr5ulva-lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo) ** | MIST-28M | MoleculeNet Lipophilicity |
| [mist-28M-gzwqzpcr-qm8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8) ** | MIST-28M | MoleculeNet QM8 |
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) ** | MIST-28M | MoleculeNet QM9 |
`**` Indicates publically released models.
`*` Indicates models currently not available on hugging-face due to storage limits.
#### QM9 Benchmark Models
The single target (MIST-1.8B encoder) models for properties in QM9 are available.
| Folder | Encoder | Target |
| ---------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------- |
| [mist-1.8B-ez05expv-mu](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu) | MIST-1.8B | μ - Dipole moment (unit: D) |
| mist-1.8B-rcwary93-alpha * | MIST-1.8B | α - Isotropic polarizability (unit: Bohr^3) |
| mist-1.8B-jmjosq12-homo * | MIST-1.8B | HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
| mist-1.8B-n14wshc9-lumo * | MIST-1.8B | LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
| mist-1.8B-kayun6v3-gap * | MIST-1.8B | Gap - Gap between HOMO and LUMO (unit: Hartree) |
| mist-1.8B-xxe7t35e-r2 * | MIST-1.8B | \<R2\> - Electronic spatial extent (unit: Bohr^2) |
| [mist-1.8B-6nmcwyrp-zpve](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve) | MIST-1.8B | ZPVE - Zero point vibrational energy (unit: Hartree) |
| [mist-1.8B-a7akimjj-u0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0) | MIST-1.8B | U0 - Internal energy at 0K (unit: Hartree) |
| [mist-1.8B-85f24xkj-u298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298) | MIST-1.8B | U298 - Internal energy at 298.15K (unit: Hartree) |
| [mist-1.8B-3fbbz4is-h298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298) | MIST-1.8B | H298 - Enthalpy at 298.15K (unit: Hartree) |
| [mist-1.8B-09sntn03-g298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298) | MIST-1.8B | G298 - Free energy at 298.15K (unit: Hartree) |
| [mist-1.8B-j356b3nf-cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv) | MIST-1.8B | Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
`*` Indicates models currently not available on hugging-face due to storage limits
### Finetuned Single Task Models
These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
| Folder | Encoder | Dataset |
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------------------- |
| [mist-26.9M-48kpooqf-odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour) | MIST-28M | Olfaction |
| [mist-26.9M-6hk5coof-dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn) | MIST-28M | Donor Number |
| [mist-26.9M-0vxdbm36-kt](https://huggingface.co/mist-models/mist-26.9M-0vxdbm36-kt) | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
| [mist-26.9M-b302p09x-bp](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp) | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
| [mist-26.9M-cyuo2xb6-fp](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp) | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
| [mist-26.9M-y3ge5pf9-mp](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp) | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
### Finetuned Multi-Task Models
These are additional multi-target finetuned models consisting of a MIST encoder and task network.
| Folder | Encoder | Dataset |
| ---------------------------------------------------------------------- | :------: | ------------------------------------- |
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) | MIST-28M | QM9 Dataset with SMILES randomization |
| [mist-28M-ttqcvt6fs-toxcast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast) | MIST-28M | ToxCast |
| [mist-28M-yr1urd2c-muv](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv) | MIST-28M | Maximum Unbiased Validation (MUV) |
| [mist-models/mist-28M-ggd8iisr-tmQM](https://huggingface.co/mist-models/mist-models/mist-28M-ggd8iisr-tmQM) ** | MIST-28M | QM properties of transition metal orgaomettallics |
`**` Indicates publically released models.
### Finetuned Mixture Models
These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
| Folder | Encoder | Dataset |
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------- |
| [mist-conductivity-28M-2mpg8dcd](https://huggingface.co/mist-models/mist-conductivity-28M-2mpg8dcd) | MIST-28M | Ionic Conductivity |
| [mist-mixtures-zffffbex](https://huggingface.co/mist-models/mist-mixtures-zffffbex) | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
## Citation
If you use this model in your research, please cite:
```bibtex
@online{MIST,
title = {Foundation Models for Discovery and Exploration in Chemical Space},
author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
date = {2025-10-20},
eprint = {2510.18900},
eprinttype = {arXiv},
eprintclass = {physics},
doi = {10.48550/arXiv.2510.18900},
url = {http://arxiv.org/abs/2510.18900},
}
```
## License and Notice
Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.
**Restrictions:**
- Research use only
- No redistribution without permission
- No commercial use without licensing agreement
For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).
<hr> |