Spaces:
No application file
No application file
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,202 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
colorTo: purple
|
| 6 |
-
sdk: static
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
|
|
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language: en
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- mist
|
| 7 |
+
- chemistry
|
| 8 |
+
- molecular-property-prediction
|
| 9 |
+
title: ' MIST: Molecular Insight SMILES Transformers'
|
| 10 |
+
sdk: streamlit
|
| 11 |
+
emoji: 🚀
|
| 12 |
+
colorFrom: indigo
|
| 13 |
colorTo: purple
|
|
|
|
| 14 |
pinned: false
|
| 15 |
+
thumbnail: >-
|
| 16 |
+
https://cdn-uploads.huggingface.co/production/uploads/672fec35d68675461e02d9ab/NqT2SG2Ox5Z1bpGKNjXBP.png
|
| 17 |
---
|
| 18 |
+
# MIST: Molecular Insight SMILES Transformers
|
| 19 |
|
| 20 |
+
MIST is a family of molecular foundation models for molecular property prediction.
|
| 21 |
+
The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
|
| 22 |
+
|
| 23 |
+
## Model Details
|
| 24 |
+
|
| 25 |
+
- **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
|
| 26 |
+
- **Pre-training**: Masked Language Modeling on molecular SMILES
|
| 27 |
+
- **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
## Model Inputs and Outputs
|
| 32 |
+
|
| 33 |
+
### Inputs
|
| 34 |
+
- **SMILES strings**: Standard SMILES notation for molecular structures
|
| 35 |
+
- **Batch size**: Variable, automatically padded during inference
|
| 36 |
+
|
| 37 |
+
### Outputs
|
| 38 |
+
- **Predictions**: Task-specific numerical or categorical predictions
|
| 39 |
+
- **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## Quick Start
|
| 43 |
+
|
| 44 |
+
Tutorials are available in Google Colab:
|
| 45 |
+
- [Inference](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/molecular_property_prediction.ipynb)
|
| 46 |
+
- [Finetuning](https://colab.research.google.com/github/BattModels/mist-demo/blob/main/tutorials/run_finetuning.ipynb)
|
| 47 |
+
|
| 48 |
+
#### Running Locally
|
| 49 |
+
|
| 50 |
+
To run the model locally, create a virtual environment and install dependencies:
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
python -m venv .venv
|
| 54 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 55 |
+
pip install -r requirements.txt
|
| 56 |
+
```
|
| 57 |
+
> **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
Use the model!
|
| 61 |
+
For a full list of model IDs and properties see the list of provided models below.
|
| 62 |
+
For details on the specific inputs and outputs formats for each model variant see the model card.
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from transformers import AutoModel
|
| 66 |
+
from smirk import SmirkTokenizerFast
|
| 67 |
+
|
| 68 |
+
# Load the model
|
| 69 |
+
model = AutoModel.from_pretrained(
|
| 70 |
+
"mist-models/mist-{size}-{model_id}-{property}",
|
| 71 |
+
trust_remote_code=True
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
# Make predictions
|
| 75 |
+
smiles_batch = [
|
| 76 |
+
"CCO", # Ethanol
|
| 77 |
+
"CC(=O)O", # Acetic acid
|
| 78 |
+
"C1=CC=CC=C1" # Benzene
|
| 79 |
+
]
|
| 80 |
+
results = model.predict(smiles_batch)
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## Provided Models
|
| 84 |
+
|
| 85 |
+
### Pre-trained
|
| 86 |
+
- [`mist-1.8B-dh61satt`](https://huggingface.co/mist-models/mist-1.8B-dh61satt): Flagship MIST model (MIST-1.8B)
|
| 87 |
+
- [`mist-28M-ti624ev1`](https://huggingface.co/mist-models/mist-28M-ti624ev1): Smaller MIST model (MIST-28M).
|
| 88 |
+
|
| 89 |
+
Below is a full list of finetuned variants hosted on HuggingFace:
|
| 90 |
+
### MoleculeNet Benchmark Models
|
| 91 |
+
|
| 92 |
+
| Folder | Encoder | Dataset |
|
| 93 |
+
| ---------------------------------------------------------------------- | :-------: | ------------------------- |
|
| 94 |
+
| [mist-1.8B-fbdn8e35-bbbp](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp) | MIST-1.8B | MoleculeNet BBBP |
|
| 95 |
+
| [mist-1.8B-1a4puhg2-hiv](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv) | MIST-1.8B | MoleculeNet HIV |
|
| 96 |
+
| [mist-1.8B-m50jgolp-bace](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace) | MIST-1.8B | MoleculeNet BACE |
|
| 97 |
+
| [mist-1.8B-uop1z0dc-tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21) | MIST-1.8B | MoleculeNet Tox21 |
|
| 98 |
+
| [mist-1.8B-lu1l5ieh-clintox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox) | MIST-1.8B | MoleculeNet ClinTox |
|
| 99 |
+
| mist-1.8B-l1wfo7oa-sider * | MIST-1.8B | MoleculeNet SIDER. |
|
| 100 |
+
| mist-1.8B-hxiygjsm-esol * | MIST-1.8B | MoleculeNet ESOL |
|
| 101 |
+
| [mist-1.8B-iwqj2cld-freesolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv) | MIST-1.8B | MoleculeNet FreeSolv |
|
| 102 |
+
| [mist-1.8B-jvt4azpz-lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo) | MIST-1.8B | MoleculeNet Lipophilicity |
|
| 103 |
+
| [mist-1.8B-8nd1ot5j-qm8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8) | MIST-1.8B | MoleculeNet QM8 |
|
| 104 |
+
| [mist-28M-3xpfhv48-bbbp](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp) ** | MIST-28M | MoleculeNet BBBP |
|
| 105 |
+
| [mist-28M-8fh43gke-hiv](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv) ** | MIST-28M | MoleculeNet HIV |
|
| 106 |
+
| [mist-28M-8loj3bab-bace](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace) ** | MIST-28M | MoleculeNet BACE |
|
| 107 |
+
| [mist-28M-kw4ks27p-tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21) ** | MIST-28M | MoleculeNet Tox21 |
|
| 108 |
+
| [mist-28M-97vfcykk-clintox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox) ** | MIST-28M | MoleculeNet ClinTox |
|
| 109 |
+
| [mist-28M-z8qo16uy-sider](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider) ** | MIST-28M | MoleculeNet SIDER |
|
| 110 |
+
| [mist-28M-kcwb9le5-esol](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol) ** | MIST-28M | MoleculeNet ESOL |
|
| 111 |
+
| mist-28M-0uiq7o7m-freesolv * | MIST-28M | MoleculeNet FreeSolv |
|
| 112 |
+
| [mist-28M-xzr5ulva-lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo) ** | MIST-28M | MoleculeNet Lipophilicity |
|
| 113 |
+
| [mist-28M-gzwqzpcr-qm8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8) ** | MIST-28M | MoleculeNet QM8 |
|
| 114 |
+
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) ** | MIST-28M | MoleculeNet QM9 |
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
`**` Indicates publically released models.
|
| 118 |
+
`*` Indicates models currently not available on hugging-face due to storage limits.
|
| 119 |
+
|
| 120 |
+
#### QM9 Benchmark Models
|
| 121 |
+
The single target (MIST-1.8B encoder) models for properties in QM9 are available.
|
| 122 |
+
|
| 123 |
+
| Folder | Encoder | Target |
|
| 124 |
+
| ---------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------- |
|
| 125 |
+
| [mist-1.8B-ez05expv-mu](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu) | MIST-1.8B | μ - Dipole moment (unit: D) |
|
| 126 |
+
| mist-1.8B-rcwary93-alpha * | MIST-1.8B | α - Isotropic polarizability (unit: Bohr^3) |
|
| 127 |
+
| mist-1.8B-jmjosq12-homo * | MIST-1.8B | HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
|
| 128 |
+
| mist-1.8B-n14wshc9-lumo * | MIST-1.8B | LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
|
| 129 |
+
| mist-1.8B-kayun6v3-gap * | MIST-1.8B | Gap - Gap between HOMO and LUMO (unit: Hartree) |
|
| 130 |
+
| mist-1.8B-xxe7t35e-r2 * | MIST-1.8B | \<R2\> - Electronic spatial extent (unit: Bohr^2) |
|
| 131 |
+
| [mist-1.8B-6nmcwyrp-zpve](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve) | MIST-1.8B | ZPVE - Zero point vibrational energy (unit: Hartree) |
|
| 132 |
+
| [mist-1.8B-a7akimjj-u0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0) | MIST-1.8B | U0 - Internal energy at 0K (unit: Hartree) |
|
| 133 |
+
| [mist-1.8B-85f24xkj-u298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298) | MIST-1.8B | U298 - Internal energy at 298.15K (unit: Hartree) |
|
| 134 |
+
| [mist-1.8B-3fbbz4is-h298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298) | MIST-1.8B | H298 - Enthalpy at 298.15K (unit: Hartree) |
|
| 135 |
+
| [mist-1.8B-09sntn03-g298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298) | MIST-1.8B | G298 - Free energy at 298.15K (unit: Hartree) |
|
| 136 |
+
| [mist-1.8B-j356b3nf-cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv) | MIST-1.8B | Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
|
| 137 |
+
|
| 138 |
+
`*` Indicates models currently not available on hugging-face due to storage limits
|
| 139 |
+
|
| 140 |
+
### Finetuned Single Task Models
|
| 141 |
+
|
| 142 |
+
These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
|
| 143 |
+
|
| 144 |
+
| Folder | Encoder | Dataset |
|
| 145 |
+
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------------------- |
|
| 146 |
+
| [mist-26.9M-48kpooqf-odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour) | MIST-28M | Olfaction |
|
| 147 |
+
| [mist-26.9M-6hk5coof-dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn) | MIST-28M | Donor Number |
|
| 148 |
+
| [mist-26.9M-0vxdbm36-kt](https://huggingface.co/mist-models/mist-26.9M-0vxdbm36-kt) | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
|
| 149 |
+
| [mist-26.9M-b302p09x-bp](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp) | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
|
| 150 |
+
| [mist-26.9M-cyuo2xb6-fp](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp) | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
|
| 151 |
+
| [mist-26.9M-y3ge5pf9-mp](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp) | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
|
| 152 |
+
|
| 153 |
+
### Finetuned Multi-Task Models
|
| 154 |
+
These are additional multi-target finetuned models consisting of a MIST encoder and task network.
|
| 155 |
+
|
| 156 |
+
| Folder | Encoder | Dataset |
|
| 157 |
+
| ---------------------------------------------------------------------- | :------: | ------------------------------------- |
|
| 158 |
+
| [mist-26.9M-kkgx0omx-qm9](https://huggingface.co/mist-models/mist-26.9M-kkgx0omx-qm9) | MIST-28M | QM9 Dataset with SMILES randomization |
|
| 159 |
+
| [mist-28M-ttqcvt6fs-toxcast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast) | MIST-28M | ToxCast |
|
| 160 |
+
| [mist-28M-yr1urd2c-muv](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv) | MIST-28M | Maximum Unbiased Validation (MUV) |
|
| 161 |
+
| [mist-models/mist-28M-ggd8iisr-tmQM](https://huggingface.co/mist-models/mist-models/mist-28M-ggd8iisr-tmQM) ** | MIST-28M | QM properties of transition metal orgaomettallics |
|
| 162 |
+
|
| 163 |
+
`**` Indicates publically released models.
|
| 164 |
+
|
| 165 |
+
### Finetuned Mixture Models
|
| 166 |
+
|
| 167 |
+
These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
|
| 168 |
+
|
| 169 |
+
| Folder | Encoder | Dataset |
|
| 170 |
+
| ---------------------------------------------------------------------- | :------: | ----------------------------------------------- |
|
| 171 |
+
| [mist-conductivity-28M-2mpg8dcd](https://huggingface.co/mist-models/mist-conductivity-28M-2mpg8dcd) | MIST-28M | Ionic Conductivity |
|
| 172 |
+
| [mist-mixtures-zffffbex](https://huggingface.co/mist-models/mist-mixtures-zffffbex) | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
|
| 173 |
+
|
| 174 |
+
## Citation
|
| 175 |
+
|
| 176 |
+
If you use this model in your research, please cite:
|
| 177 |
+
|
| 178 |
+
```bibtex
|
| 179 |
+
@online{MIST,
|
| 180 |
+
title = {Foundation Models for Discovery and Exploration in Chemical Space},
|
| 181 |
+
author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
|
| 182 |
+
date = {2025-10-20},
|
| 183 |
+
eprint = {2510.18900},
|
| 184 |
+
eprinttype = {arXiv},
|
| 185 |
+
eprintclass = {physics},
|
| 186 |
+
doi = {10.48550/arXiv.2510.18900},
|
| 187 |
+
url = {http://arxiv.org/abs/2510.18900},
|
| 188 |
+
}
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## License and Notice
|
| 192 |
+
|
| 193 |
+
Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.
|
| 194 |
+
|
| 195 |
+
**Restrictions:**
|
| 196 |
+
- Research use only
|
| 197 |
+
- No redistribution without permission
|
| 198 |
+
- No commercial use without licensing agreement
|
| 199 |
+
|
| 200 |
+
For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).
|
| 201 |
+
|
| 202 |
+
<hr>
|