Molecular Odor Predictor
A PyTorch MLP that predicts odor descriptors from a molecule's SMILES string using Morgan (ECFP4) fingerprints. Given any molecule, the model outputs a smell profile across 50 odor categories.
Model description
| Property | Detail |
|---|---|
| Architecture | MLP β 2048 β 512 β 256 β 50 |
| Regularisation | BatchNorm + Dropout (0.4) per hidden layer |
| Input | 2048-bit Morgan fingerprint (ECFP4, radius=2) |
| Output | 50-class multi-label probabilities (sigmoid) |
| Loss | BCEWithLogitsLoss |
| Optimiser | Adam, lr=1e-3 with ReduceLROnPlateau |
| Epochs | 80 (best checkpoint at epoch 22) |
Performance
Evaluated on a held-out test set of 545 molecules:
| Metric | Score |
|---|---|
| Macro F1 | 0.421 |
| Micro F1 | 0.498 |
| Hamming loss | 0.080 |
RandomForest baseline macro F1: 0.374 (MLP is +13% relative improvement).
Top-performing labels: sulfurous (0.76), fruity (0.67), balsamic (0.61), floral (0.61), fatty (0.60)
Labels (50)
fruity, green, sweet, floral, herbal, woody, fatty, fresh, waxy, spicy, citrus, sulfurous, tropical, oily, nutty, earthy, rose, balsamic, apple, vegetable, meaty, ethereal, roasted, caramellic, winey, pineapple, musty, pungent, creamy, cheesy, minty, phenolic, onion, burnt, powdery, berry, aldehydic, camphoreous, honey, pear, melon, fermented, buttery, metallic, leafy, savory, animal, alliaceous, cocoa, dairy
Usage
import json
import numpy as np
import torch
import torch.nn as nn
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from huggingface_hub import hf_hub_download
REPO = "Hari5115/molecular-odor-predictor"
class OdorMLP(nn.Module):
def __init__(self, n_inputs, n_outputs, dropout=0.4):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_inputs, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(256, n_outputs),
)
def forward(self, x):
return self.net(x)
# Load
labels = json.load(open(hf_hub_download(REPO, "labels.json")))
model = OdorMLP(2048, len(labels))
model.load_state_dict(torch.load(hf_hub_download(REPO, "best_model.pt"), map_location="cpu"))
model.eval()
# Predict
fp_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
mol = Chem.MolFromSmiles("COc1cc(C=O)ccc1O") # vanillin
fp = torch.tensor([fp_gen.GetFingerprint(mol)], dtype=torch.float32)
with torch.no_grad():
probs = torch.sigmoid(model(fp)).squeeze().numpy()
for label, prob in sorted(zip(labels, probs), key=lambda x: -x[1]):
if prob >= 0.3:
print(f"{label}: {prob:.0%}")
Training data
5,308 unique molecules from the GoodScents and Leffingwell databases, split 80/10/10 into train/val/test using stratified multi-label splitting (iterative_train_test_split).
Data sources & credits
This model was trained on data from two public olfactory databases, accessed via the Pyrfume open-science library:
The Good Scent Company (TGSC) β GoodScents database Odor descriptors and molecular identifiers from goodscentscompany.com
Leffingwell & Associates β Leffingwell Flavor & Fragrance database Odor descriptors from leffingwell.com
Pyrfume β Open science library for standardising and publishing olfactory data. Mainland JD, et al. Pyrfume: A window to the world's olfactory data. github.com/pyrfume/pyrfume
Molecular SMILES strings sourced from PubChem via Pyrfume.
Limitations
- The model predicts odor descriptors (how people describe the smell), not physical odor intensity or threshold.
- Performance is lower on rare/subjective descriptors (buttery, powdery, metallic) which lack distinct molecular fingerprint patterns.
- Trained only on molecules with PubChem entries β novel or proprietary molecules may be out-of-distribution.
Demo
Try the live demo: Hari5115/molecular-odor-demo