anoushka2000 commited on
Commit
6b0b5a4
·
verified ·
1 Parent(s): 03eaca9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +175 -7
README.md CHANGED
@@ -1,10 +1,178 @@
1
  ---
2
- title: README
3
- emoji: 🏆
4
- colorFrom: green
5
- colorTo: yellow
6
- sdk: docker
7
- pinned: false
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ library_name: transformers
4
+ license: gpl-3.0
5
+ tags:
6
+ - mist
7
+ - chemistry
8
+ - molecular-property-prediction
9
  ---
10
 
11
+ # MIST: Molecular Insight SMILES Transformers
12
+
13
+ MIST is a family of molecular foundation models for molecular property prediction.
14
+ The models were pre-trained on SMILES strings from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator) dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks.
15
+
16
+ ## Model Details
17
+
18
+ - **Architecture**: Encoder-only transformer [``RoBERTa-PreLayerNorm``](https://huggingface.co/docs/transformers/en/model_doc/roberta-prelayernorm)
19
+ - **Pre-training**: Masked Language Modeling on molecular SMILES
20
+ - **Tokenization**: [``Smirk``](https://eeg.engin.umich.edu/smirk/) tokenizer
21
+
22
+
23
+ ### Quick Start
24
+
25
+ ```python
26
+ from transformers import AutoModel
27
+
28
+ # Load the model
29
+ model = AutoModel.from_pretrained(
30
+ "path/to/model",
31
+ trust_remote_code=True
32
+ )
33
+
34
+ # Make predictions
35
+ smiles_batch = [
36
+ "CCO", # Ethanol
37
+ "CC(=O)O", # Acetic acid
38
+ "c1ccccc1" # Benzene
39
+ ]
40
+ results = model.predict(smiles_batch)
41
+ ```
42
+
43
+ ### Setting Up Your Environment
44
+
45
+ Create a virtual environment and install dependencies:
46
+
47
+ ```bash
48
+ python -m venv .venv
49
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
50
+ pip install -r requirements.txt
51
+ ```
52
+
53
+ > **Note**: SMIRK tokenizers require Rust to be installed. See the [Rust installation guide](https://www.rust-lang.org/tools/install) for details.
54
+
55
+
56
+ ## Model Inputs and Outputs
57
+
58
+ ### Inputs
59
+ - **SMILES strings**: Standard SMILES notation for molecular structures
60
+ - **Batch size**: Variable, automatically padded during inference
61
+
62
+ ### Outputs
63
+ - **Predictions**: Task-specific numerical or categorical predictions
64
+ - **Format**: Dictionary with channel names and predicted values (if channels are configured), or raw tensor output
65
+
66
+
67
+ ## Provided Models
68
+
69
+ ### Pre-trained
70
+ - `mist-1.8B-dh61satt`: Flagship MIST model (MIST-1.8B)
71
+ - `mist-28M-ti624ev1`: Smaller MIST model (MIST-28M).
72
+
73
+ Below is a full list of finetuned variants hosted on HuggingFace:
74
+ ### MoleculeNet Benchmark Models
75
+
76
+ | Folder | Encoder | Dataset |
77
+ | ---------------------------- | :------: | ------------------------------------ |
78
+ | mist-1.8B-fbdn8e35-bbbp | MIST-1.8B| MoleculeNet BBBP |
79
+ | mist-1.8B-1a4puhg2-hiv | MIST-1.8B| MoleculeNet HIV |
80
+ | mist-1.8B-m50jgolp-bace | MIST-1.8B| MoleculeNet BACE |
81
+ | mist-1.8B-uop1z0dc-tox21 | MIST-1.8B| MoleculeNet Tox21 |
82
+ | mist-1.8B-lu1l5ieh-clintox | MIST-1.8B| MoleculeNet ClinTox |
83
+ | mist-1.8B-l1wfo7oa-sider | MIST-1.8B| MoleculeNet SIDER. |
84
+ | mist-1.8B-hxiygjsm-esol | MIST-1.8B| MoleculeNet ESOL |
85
+ | mist-1.8B-iwqj2cld-freesolv | MIST-1.8B| MoleculeNet FreeSolv |
86
+ | mist-1.8B-jvt4azpz-lipo | MIST-1.8B| MoleculeNet Lipophilicity |
87
+ | mist-1.8B-8nd1ot5j-qm8 | MIST-1.8B| MoleculeNet QM8 |
88
+ | mist-28M-3xpfhv48-bbbp | MIST-28M | MoleculeNet BBBP |
89
+ | mist-28M-8fh43gke-hiv | MIST-28M | MoleculeNet HIV |
90
+ | mist-28M-8loj3bab-bace | MIST-28M | MoleculeNet BACE |
91
+ | mist-28M-kw4ks27p-tox21 | MIST-28M | MoleculeNet Tox21 |
92
+ | mist-28M-97vfcykk-clintox | MIST-28M | MoleculeNet ClinTox |
93
+ | mist-28M-z8qo16uy-sider | MIST-28M | MoleculeNet SIDER |
94
+ | mist-28M-kcwb9le5-esol | MIST-28M | MoleculeNet ESOL |
95
+ | mist-28M-0uiq7o7m-freesolv | MIST-28M | MoleculeNet FreeSolv |
96
+ | mist-28M-xzr5ulva-lipo | MIST-28M | MoleculeNet Lipophilicity |
97
+ | mist-28M-gzwqzpcr-qm8 | MIST-28M | MoleculeNet QM8 |
98
+
99
+
100
+ #### QM9 Benchmark Models
101
+ The single target (MIST-1.8B encoder) models for properties in QM9 are available.
102
+
103
+ | Folder | Encoder | Target |
104
+ | ---------------------------- | :------: | ----------------------------------------------------------------- |
105
+ | mist-1.8B-ez05expv-mu | MIST-1.8B| μ - Dipole moment (unit: D) |
106
+ | mist-1.8B-rcwary93-alpha | MIST-1.8B| α - Isotropic polarizability (unit: Bohr^3) |
107
+ | mist-1.8B-jmjosq12-homo | MIST-1.8B| HOMO - Highest occupied molecular orbital energy (unit: Hartree) |
108
+ | mist-1.8B-n14wshc9-lumo | MIST-1.8B| LUMO - Lowest unoccupied molecular orbital energy (unit: Hartree) |
109
+ | mist-1.8B-kayun6v3-gap | MIST-1.8B| Gap - Gap between HOMO and LUMO (unit: Hartree) |
110
+ | mist-1.8B-xxe7t35e-r2 | MIST-1.8B| \<R2\> - Electronic spatial extent (unit: Bohr^2) |
111
+ | mist-1.8B-6nmcwyrp-zpve | MIST-1.8B| ZPVE - Zero point vibrational energy (unit: Hartree) |
112
+ | mist-1.8B-a7akimjj-u0 | MIST-1.8B| U0 - Internal energy at 0K (unit: Hartree) |
113
+ | mist-1.8B-85f24xkj-u298 | MIST-1.8B| U298 - Internal energy at 298.15K (unit: Hartree) |
114
+ | mist-1.8B-3fbbz4is-h298 | MIST-1.8B| H298 - Enthalpy at 298.15K (unit: Hartree) |
115
+ | mist-1.8B-09sntn03-g298 | MIST-1.8B| G298 - Free energy at 298.15K (unit: Hartree) |
116
+ | mist-1.8B-j356b3nf-cv | MIST-1.8B| Cv - Heat capacity at 298.15K (unit: cal/(mol*K)) |
117
+
118
+ - `mist-ti624ev1-moleculenet`: Contains MoleculeNet benchmark MIST-28M models trained as part of doi:10.5281/zenodo.13761263
119
+
120
+
121
+ ### Finetuned Single Task Models
122
+
123
+ These models consist of a MIST-encoder and task network finetuned on a single dataset used in the applications demonstrated in the manuscript.
124
+
125
+ | Folder | Encoder | Dataset |
126
+ | ------------------------- | :------: | ----------------------------------------------------------- |
127
+ | mist-26.9M-48kpooqf-odour | MIST-28M | Olfaction |
128
+ | mist-26.9M-6hk5coof-dn | MIST-28M | Donor Number |
129
+ | mist-26.9M-0vxdbm36-kt | MIST-28M | Kamlet-Taft Solvochromatic Parameters |
130
+ | mist-26.9M-b302p09x-bp | MIST-28M | Boiling Point (Part of Characteristic Temperatures Dataset) |
131
+ | mist-26.9M-cyuo2xb6-fp | MIST-28M | Flash Point (Part of Characteristic Temperatures Dataset) |
132
+ | mist-26.9M-y3ge5pf9-mp | MIST-28M | Melting Point (Part of Characteristic Temperatures Dataset) |
133
+
134
+ ### Finetuned Multi-Task Models
135
+ These are additional multi-target finetuned models consisting of a MIST encoder and task network.
136
+ | Folder | Encoder | Dataset |
137
+ | -------------------------- | :------: | ----------------------------------------------------------- |
138
+ | mist-26.9M-kkgx0omx-qm9 | MIST-28M | QM9 Dataset with SMILES randomization |
139
+ | mist-28M-ttqcvt6fs-toxcast | MIST-28M | ToxCast |
140
+ | mist-28M-yr1urd2c-muv | MIST-28M | Maximum Unbiased Validation (MUV) |
141
+
142
+ ### Finetuned Mixture Models
143
+
144
+ These models consist of a MIST-encoder and physics informed task network for mixture property prediction.
145
+ | Folder | Encoder | Dataset |
146
+ | -------------------------------- | :------: | ----------------------------------------------------------- |
147
+ | mist-conductivity-28M-2mpg8dcd | MIST-28M | Ionic Conductivity |
148
+ | mist-mixtures-zffffbex | MIST-28M | Excess Density, Molar Volume and Molar Enthalpy |
149
+
150
+ ## Citation
151
+
152
+ If you use this model in your research, please cite:
153
+
154
+ ```bibtex
155
+ @online{MIST,
156
+ title = {Foundation Models for Discovery and Exploration in Chemical Space},
157
+ author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
158
+ date = {2025-10-20},
159
+ eprint = {2510.18900},
160
+ eprinttype = {arXiv},
161
+ eprintclass = {physics},
162
+ doi = {10.48550/arXiv.2510.18900},
163
+ url = {http://arxiv.org/abs/2510.18900},
164
+ }
165
+ ```
166
+
167
+ ## License and Notice
168
+
169
+ Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.
170
+
171
+ **Restrictions:**
172
+ - Research use only
173
+ - No redistribution without permission
174
+ - No commercial use without licensing agreement
175
+
176
+ For questions, issues, or licensing inquiries, please contact [venkvis@umich.edu](mailto:venkvis@umich.edu).
177
+
178
+ <hr>