Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,178 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language: en
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: gpl-3.0
|
| 5 |
+
tags:
|
| 6 |
+
- mist
|
| 7 |
+
- chemistry
|
| 8 |
+
- molecular-property-prediction
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# MIST: Molecular Insight SMILES Transformers
|
| 12 |
+
|
| 13 |
+
MIST is a family of molecular foundation models for molecular property prediction.
|
| 14 |
+
The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
|
| 15 |
+
|
| 16 |
+
## Model Details
|
| 17 |
+
|
| 18 |
+
- **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
|
| 19 |
+
- **Pre-training**: Masked Language Modeling on molecular SMILES
|
| 20 |
+
- **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
### Quick Start
|
| 24 |
+
|
| 25 |
+
```python
|
| 26 |
+
from transformers import AutoModel
|
| 27 |
+
|
| 28 |
+
# Load the model
|
| 29 |
+
model = AutoModel.from_pretrained(
|
| 30 |
+
"path/to/model",
|
| 31 |
+
trust_remote_code=True
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
# Make predictions
|
| 35 |
+
smiles_batch = [
|
| 36 |
+
"CCO", # Ethanol
|
| 37 |
+
"CC(=O)O", # Acetic acid
|
| 38 |
+
"c1ccccc1" # Benzene
|
| 39 |
+
]
|
| 40 |
+
results = model.predict(smiles_batch)
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Setting Up Your Environment
|
| 44 |
+
|
| 45 |
+
Create a virtual environment and install dependencies:
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
python -m venv .venv
|
| 49 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 50 |
+
pip install -r requirements.txt
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
> **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## Model Inputs and Outputs
|
| 57 |
+
|
| 58 |
+
### Inputs
|
| 59 |
+
- **SMILES strings**: Standard SMILES notation for molecular structures
|
| 60 |
+
- **Batch size**: Variable, automatically padded during inference
|
| 61 |
+
|
| 62 |
+
### Outputs
|
| 63 |
+
- **Predictions**: Task-specific numerical or categorical predictions
|
| 64 |
+
- **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Provided Models
|
| 68 |
+
|
| 69 |
+
### Pre-trained
|
| 70 |
+
- `mist-1.8B-dh61satt`: Flagship MIST model (MIST-1.8B)
|
| 71 |
+
- `mist-28M-ti624ev1`: Smaller MIST model (MIST-28M).
|
| 72 |
+
|
| 73 |
+
Below is a full list of finetuned variants hosted on HuggingFace:
|
| 74 |
+
### MoleculeNet Benchmark Models
|
| 75 |
+
|
| 76 |
+
| Folder | Encoder | Dataset |
|
| 77 |
+
| ---------------------------- | :------: | ------------------------------------ |
|
| 78 |
+
| mist-1.8B-fbdn8e35-bbbp | MIST-1.8B| MoleculeNet BBBP |
|
| 79 |
+
| mist-1.8B-1a4puhg2-hiv | MIST-1.8B| MoleculeNet HIV |
|
| 80 |
+
| mist-1.8B-m50jgolp-bace | MIST-1.8B| MoleculeNet BACE |
|
| 81 |
+
| mist-1.8B-uop1z0dc-tox21 | MIST-1.8B| MoleculeNet Tox21 |
|
| 82 |
+
| mist-1.8B-lu1l5ieh-clintox | MIST-1.8B| MoleculeNet ClinTox |
|
| 83 |
+
| mist-1.8B-l1wfo7oa-sider | MIST-1.8B| MoleculeNet SIDER. |
|
| 84 |
+
| mist-1.8B-hxiygjsm-esol | MIST-1.8B| MoleculeNet ESOL |
|
| 85 |
+
| mist-1.8B-iwqj2cld-freesolv | MIST-1.8B| MoleculeNet FreeSolv |
|
| 86 |
+
| mist-1.8B-jvt4azpz-lipo | MIST-1.8B| MoleculeNet Lipophilicity |
|
| 87 |
+
| mist-1.8B-8nd1ot5j-qm8 | MIST-1.8B| MoleculeNet QM8 |
|
| 88 |
+
| mist-28M-3xpfhv48-bbbp | MIST-28M | MoleculeNet BBBP |
|
| 89 |
+
| mist-28M-8fh43gke-hiv | MIST-28M | MoleculeNet HIV |
|
| 90 |
+
| mist-28M-8loj3bab-bace | MIST-28M | MoleculeNet BACE |
|
| 91 |
+
| mist-28M-kw4ks27p-tox21 | MIST-28M | MoleculeNet Tox21 |
|
| 92 |
+
| mist-28M-97vfcykk-clintox | MIST-28M | MoleculeNet ClinTox |
|
| 93 |
+
| mist-28M-z8qo16uy-sider | MIST-28M | MoleculeNet SIDER |
|
| 94 |
+
| mist-28M-kcwb9le5-esol | MIST-28M | MoleculeNet ESOL |
|
| 95 |
+
| mist-28M-0uiq7o7m-freesolv | MIST-28M | MoleculeNet FreeSolv |
|
| 96 |
+
| mist-28M-xzr5ulva-lipo | MIST-28M | MoleculeNet Lipophilicity |
|
| 97 |
+
| mist-28M-gzwqzpcr-qm8 | MIST-28M | MoleculeNet QM8 |
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
#### QM9 Benchmark Models
|
| 101 |
+
The single target (MIST-1.8B encoder) models for properties in QM9 are available.
|
| 102 |
+
|
| 103 |
+
| Folder | Encoder | Target |
|
| 104 |
+
| ---------------------------- | :------: | ----------------------------------------------------------------- |
|
| 105 |
+
| mist-1.8B-ez05expv-mu | MIST-1.8B| μ - Dipole moment (unit: D) |
|
| 106 |
+
| mist-1.8B-rcwary93-alpha | MIST-1.8B| α - Isotropic polarizability (unit: Bohr^3) |
|
| 107 |
+
| mist-1.8B-jmjosq12-homo | MIST-1.8B| HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
|
| 108 |
+
| mist-1.8B-n14wshc9-lumo | MIST-1.8B| LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
|
| 109 |
+
| mist-1.8B-kayun6v3-gap | MIST-1.8B| Gap - Gap between HOMO and LUMO (unit: Hartree) |
|
| 110 |
+
| mist-1.8B-xxe7t35e-r2 | MIST-1.8B| \<R2\> - Electronic spatial extent (unit: Bohr^2) |
|
| 111 |
+
| mist-1.8B-6nmcwyrp-zpve | MIST-1.8B| ZPVE - Zero point vibrational energy (unit: Hartree) |
|
| 112 |
+
| mist-1.8B-a7akimjj-u0 | MIST-1.8B| U0 - Internal energy at 0K (unit: Hartree) |
|
| 113 |
+
| mist-1.8B-85f24xkj-u298 | MIST-1.8B| U298 - Internal energy at 298.15K (unit: Hartree) |
|
| 114 |
+
| mist-1.8B-3fbbz4is-h298 | MIST-1.8B| H298 - Enthalpy at 298.15K (unit: Hartree) |
|
| 115 |
+
| mist-1.8B-09sntn03-g298 | MIST-1.8B| G298 - Free energy at 298.15K (unit: Hartree) |
|
| 116 |
+
| mist-1.8B-j356b3nf-cv | MIST-1.8B| Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
|
| 117 |
+
|
| 118 |
+
- `mist-ti624ev1-moleculenet`: Contains MoleculeNet benchmark MIST-28M models trained as part of doi:10.5281/zenodo.13761263
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
### Finetuned Single Task Models
|
| 122 |
+
|
| 123 |
+
These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
|
| 124 |
+
|
| 125 |
+
| Folder | Encoder | Dataset |
|
| 126 |
+
| ------------------------- | :------: | ----------------------------------------------------------- |
|
| 127 |
+
| mist-26.9M-48kpooqf-odour | MIST-28M | Olfaction |
|
| 128 |
+
| mist-26.9M-6hk5coof-dn | MIST-28M | Donor Number |
|
| 129 |
+
| mist-26.9M-0vxdbm36-kt | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
|
| 130 |
+
| mist-26.9M-b302p09x-bp | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
|
| 131 |
+
| mist-26.9M-cyuo2xb6-fp | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
|
| 132 |
+
| mist-26.9M-y3ge5pf9-mp | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
|
| 133 |
+
|
| 134 |
+
### Finetuned Multi-Task Models
|
| 135 |
+
These are additional multi-target finetuned models consisting of a MIST encoder and task network.
|
| 136 |
+
| Folder | Encoder | Dataset |
|
| 137 |
+
| -------------------------- | :------: | ----------------------------------------------------------- |
|
| 138 |
+
| mist-26.9M-kkgx0omx-qm9 | MIST-28M | QM9 Dataset with SMILES randomization |
|
| 139 |
+
| mist-28M-ttqcvt6fs-toxcast | MIST-28M | ToxCast |
|
| 140 |
+
| mist-28M-yr1urd2c-muv | MIST-28M | Maximum Unbiased Validation (MUV) |
|
| 141 |
+
|
| 142 |
+
### Finetuned Mixture Models
|
| 143 |
+
|
| 144 |
+
These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
|
| 145 |
+
| Folder | Encoder | Dataset |
|
| 146 |
+
| -------------------------------- | :------: | ----------------------------------------------------------- |
|
| 147 |
+
| mist-conductivity-28M-2mpg8dcd | MIST-28M | Ionic Conductivity |
|
| 148 |
+
| mist-mixtures-zffffbex | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
|
| 149 |
+
|
| 150 |
+
## Citation
|
| 151 |
+
|
| 152 |
+
If you use this model in your research, please cite:
|
| 153 |
+
|
| 154 |
+
```bibtex
|
| 155 |
+
@online{MIST,
|
| 156 |
+
title = {Foundation Models for Discovery and Exploration in Chemical Space},
|
| 157 |
+
author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
|
| 158 |
+
date = {2025-10-20},
|
| 159 |
+
eprint = {2510.18900},
|
| 160 |
+
eprinttype = {arXiv},
|
| 161 |
+
eprintclass = {physics},
|
| 162 |
+
doi = {10.48550/arXiv.2510.18900},
|
| 163 |
+
url = {http://arxiv.org/abs/2510.18900},
|
| 164 |
+
}
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## License and Notice
|
| 168 |
+
|
| 169 |
+
Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.
|
| 170 |
+
|
| 171 |
+
**Restrictions:**
|
| 172 |
+
- Research use only
|
| 173 |
+
- No redistribution without permission
|
| 174 |
+
- No commercial use without licensing agreement
|
| 175 |
+
|
| 176 |
+
For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).
|
| 177 |
+
|
| 178 |
+
<hr>
|