Chromophore Spectral Property Predictor (7 Properties)
A Chemprop v2 multi-component MPNN model that predicts 7 spectroscopic properties of organic chromophores from molecular structure (SMILES) and solvent.
Model Description
- Architecture: MulticomponentMPNN (depth=4, hidden=400, FFN=400, dropout=0.15)
- Parameters: 1.1M
- Framework: Chemprop 2.2.1 (PyTorch Lightning)
- Inputs: Chromophore SMILES + Solvent SMILES
- Training data: 20,502 chromophore-solvent pairs from Scientific Data publication
- Training: 100 epochs on NVIDIA A100 GPU, masked loss for missing values
Predicted Properties
| Property | MAE | R² | Test Samples |
|---|---|---|---|
| Absorption max (nm) | 15.40 | 0.9503 | 1,739 |
| Emission max (nm) | 18.31 | 0.9212 | 1,847 |
| Quantum yield | 0.1232 | 0.6754 | 1,377 |
| abs FWHM (cm⁻¹) | 462.2 | 0.8247 | 664 |
| emi FWHM (cm⁻¹) | 383.6 | 0.7576 | 1,091 |
| log(ε/mol⁻¹ dm³ cm⁻¹) | 0.1403 | 0.8582 | 817 |
| Lifetime (ns) | 4.84 | 0.0834 | 685 |
Note: Lifetime prediction has very low R² and should not be relied upon.
Usage
Installation
pip install chemprop>=2.0
Prediction
chemprop predict \
-i input.csv \
--model-paths best.pt \
-o predictions.csv \
-s Chromophore Solvent
Input CSV format
Chromophore,Solvent
CCN(CC)c1ccc2c(C)cc(=O)oc2c1,CCO
Nc1ccc2c(C(F)(F)F)cc(=O)oc2c1,CC#N
Python API
import torch
from chemprop.models import MPNN
from chemprop.data import MoleculeDatapoint, MoleculeDataset, MulticomponentDataset, build_dataloader
from chemprop.featurizers import SimpleMoleculeMolGraphFeaturizer
# Load model
model_data = torch.load("best.pt", map_location="cpu", weights_only=False)
model = MPNN(
model_data["hyper_parameters"]["message_passing"],
model_data["hyper_parameters"]["agg"],
model_data["hyper_parameters"]["predictor"]
)
model.load_state_dict(model_data["state_dict"])
model.eval()
Training Details
- Dataset: DB for chromophore_Sci_Data_rev03.csv (20,836 raw entries, 20,502 after cleaning)
- Data cleaning: Removed 314 duplicates, 17 entries with negative Stokes shift, 20 entries with invalid solvent
- Missing values: Handled via masked loss (Chemprop native support)
- Split: Random 80/10/10 (train: 16,401 / val: 2,050 / test: 2,051)
- Optimizer: Adam with warmup (5 epochs) + cosine decay
- Hardware: NVIDIA A100-SXM4-40GB, RAIDEN HPC Cluster (RIKEN)
Limitations
- Quantum yield predictions have moderate accuracy (MAE=0.12, R²=0.68) due to inherent measurement noise and limited structural information
- Lifetime predictions are unreliable (R²<0.1) - this property requires quantum mechanical calculations beyond SMILES
- Best performance on common solvents (DCM, acetonitrile, toluene); rare solvents may have higher error
- Training data is from literature compilations with varying experimental conditions
Citation
If you use this model, please cite:
@article{joung2020experimental,
title={Experimental database of optical properties of organic compounds},
author={Joung, Joonyoung F and Han, Minhi and Jeong, Minseok and Park, Sungnam},
journal={Scientific Data},
volume={7},
pages={295},
year={2020}
}
@article{heid2024chemprop,
title={Chemprop: A Machine Learning Package for Chemical Property Prediction},
author={Heid, Esther and others},
journal={Journal of Chemical Information and Modeling},
volume={64},
number={1},
pages={9--17},
year={2024}
}