MIST: Molecular Insight SMILES Transformers

MIST is a family of molecular foundation models for molecular property prediction. The models were pre-trained on SMILES strings from the Enamine REAL Space dataset using the Masked Language Modeling (MLM) objective, then fine-tuned for downstream prediction tasks. Further information is available in our pre-print on arXiv.

Model Details

Model Description

This fine-tuned MIST variant consists of the MIST-28M encoder finetuned on the Excess Property Dataset curated for MIST finetuning. Fine-tuned MIST models consist of the pretrained MIST model (the encoder), followed by a task network. The architecture of our mixture excess property prediction task network was informed by the chemical thermodynamic framing of excess properties.

To learn smooth, physically consistent excess curves, we propose fine-tuning the model to predict $P_{\mathrm{E}}$ at multiple fixed mole ratios $\vec{x}' = [x'1, \dots, x'n]$. We preserve permutation invariance by taking the sum of $g{\mathrm{E}}(\vec{e}{12}, \vec{x}')$ and $\mathbb{J} \cdot g_{\mathrm{E}}(\vec{e}{21}, 1 - \vec{x}')$, where $g{\mathrm{E}}$ is a MLP, $\mathbb{J}$ is the exchange matrix, and $\vec{e}{12}$ and $\vec{e}{21}$ are embedding vectors computed using a permutation equivariant fusion operation. This fusion operation $\mathrm{fusion}(\vec{e}1, \vec{e}2)$ combines single molecule embedding vectors from MIST $\vec{e}{1}$ and $\vec{e}{2}$. We use the predicted excess property values at the control points $P_E(\vec{x}')$ to construct an interpolating polynomial and evaluate the excess property at the desired composition $x_1$. The proposed architecture can be summarized as follows: $\begin{align*} \vec{e}_{12},\ \vec{e}_{21} &= \mathrm{fusion}(\vec{e}_1, \vec{e}_2) \\ P_{\mathrm{E}}(\vec{x}') &= g_{\mathrm{E}}(\vec{e}_{12}, \vec{x}') + \mathbb{J} \cdot g_{\mathrm{E}}(\vec{e}_{21}, 1- \vec{x}')\\ \vec{c} &= \mathbf{B^{-1}} P_{\mathrm{E}}(\vec{x}') \\ P_{\mathrm{E}}(x_1) &= \sum_{j=0}^N B_j(x_1) \cdot c_j , \end{align*}$ where $\mathbf{B}$ is the basis matrix for an $N$ degree polynomial and $c_i$ are the evaluated coefficients. Similar to the Redlich-Kister polynomial, we explicitly enforce $P_{\mathrm{E}} = 0$ when $x_1 = 0$ or $x_2 = 0$ when computing our interpolating polynomial. A second MLP is used to predict the pure compound properties $P_i$ which are then used to compute the linear mixing component of the mixture property.

Developed by: Electrochemical Energy Group, University of Michigan, Ann Arbor.
Model type: Self-supervised pre-trained MIST encoder with supervised finetuning.
License: GPL 3.0 (GNU General Public License version 3)
Finetuned from model: mist-28M-ti624ev1

Model Sources

Repository: Full MIST Code
Paper: arXiv Preprint
Demo: Finetuning and Inference Demo

Getting Started

Setting Up Your Environment

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Note: SMIRK tokenizers require Rust to be installed. See the Rust installation guide for details.

Property Prediction

from transformers import AutoModel
from smirk import SmirkTokenizerFast

model = AutoModel.from_pretrained(
    "mist-models/mist-mixtures-zffffbex",
    trust_remote_code=True
)

# Make predictions for binary mixture excess properties
smiles_batch =  {
            "smiles_list": [["CCO", "CCOC(=O)C"]], 
            "composition": [[0.5, 0.5]],
            "temperature": [298.15],
  }
# Returns predictions absolute, linear mixing and excess terms
results = model.predict(smiles_batch)

Use and Restrictions

Model weights are provided as-is for research purposes only, without guarantees of correctness, fitness for purpose, or warranties of any kind.

Research use only
No redistribution without permission
No commercial use without licensing agreement

Training Details

Training Data

Pretraining We use the the Enamine REAL Space dataset to pretrain MIST models. At time of writing, Enamine REAL Space is the largest database of commercially available compounds. The dataset was constructed using forward synthetic analysis: experimentally validated building blocks were converted into synthons annotated with reactivity features. Enamine REAL Space was selected as the pretraining dataset since it was the largest database of molecular SMILES at the time of training, it is easily accessible for academic use and molecules relevant to downstream tasks, such as drug candidates, electrolytes, fragrances, live in synthetically accessible regions of chemical space.

Finetuning Dataset of temperature and composition dependent mixture excess properties (density, molar enthalpy, and molar volume) curated from literature. The dataset contains 888,045 sparse observations spanning 715 molecules and 1,519 unique binary mixtures. See the excess-properties dataset for a detailed description.

Training Procedure

Inputs

Inputs: Binary mixtures and measurement temperature (in Kelvin) defined as follows:

{
    "smiles_list": [["CCO", "CCOC(=O)C"]],  # Binary mixture
    "composition": [[0.5, 0.5]],
    "temperature": [298.15],
}

where the floats in composition correspond to the mole fraction of the first and second molecules in the smiles_list respectively.

Outputs:
- value: Density [gram / centimeter ** 3], molar volume [centimeter ** 3 / mole], molar enthalpy [joule / mole]
- linear: Linear mixing contributions for all three targets above
- excess: Excess contributions for all three targets above

Evaluation

Testing Data

Dataset was split 80/10/10 using a random split.

Metrics

MAE (Mean Absolute Error)

Technical Specifications

Model Architecture and Objective

Encoder: RoBERTa-PreLayerNorm encoder with 8 layers, a hidden size of 512, intermediate size of 2048, 8 attention heads and maximum sequence length of 2048.
Task Network: Two-layer MLP (Multi-layer perceptron)
Objective
- Pretraining: MLM (Masked Language Modeling)
- Fine-tuning: Multi-channel regression
Loss:
- Pretraining: Cross-Entropy Loss
- Fine-tuning: (MSE) Mean Squared-Error summed for absolute and excess properties
Optimizer:
- Pretraining: deepspeed.ops.lamb.FusedLAMB
- Fine-tuning: torch.optim.AdamW

Compute Infrastructure

Hardware

This model was pre-trained on 2 NVIDIA A100-SXM4-80GB GPUs in 12 hours 15 minutes. It was finetuned on 1 NVIDIA A100 GPU.

Software

This model was trained with PyTorchLightning using the DeepSpeed strategy for data distributed parallelism. Model are exported in a Safetensors format.

Citation

If you use this model in your research, please cite:

@online{MIST,
  title = {Foundation Models for Discovery and Exploration in Chemical Space},
  author = {Wadell, Alexius and Bhutani, Anoushka and Azumah, Victor and Ellis-Mohr, Austin R. and Kelly, Celia and Zhao, Hancheng and Nayak, Anuj K. and Hegazy, Kareem and Brace, Alexander and Lin, Hongyi and Emani, Murali and Vishwanath, Venkatram and Gering, Kevin and Alkan, Melisa and Gibbs, Tom and Wells, Jack and Varshney, Lav R. and Ramsundar, Bharath and Duraisamy, Karthik and Mahoney, Michael W. and Ramanathan, Arvind and Viswanathan, Venkatasubramanian},
  date = {2025-10-20},
  eprint = {2510.18900},
  eprinttype = {arXiv},
  eprintclass = {physics},
  doi = {10.48550/arXiv.2510.18900},
  url = {http://arxiv.org/abs/2510.18900},
}

Model Card Authors

Anoushka Bhutani, Alexius Wadell

Model Card Contact

For questions, issues, or licensing inquiries, please contact Venkat Viswanathan venkvis@umich.edu.

Downloads last month: 148

Safetensors

Model size

27.4M params

Tensor type

F32

Paper for mist-models/mist-mixtures-zffffbex

Foundation Models for Discovery and Exploration in Chemical Space

Paper • 2510.18900 • Published Oct 20, 2025