license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- molecule-generation
- generative-model
- vae
- selfies
- rdkit
- drug-discovery
- electrolyte
- batteries
- cheminformatics
MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design
MolForge is a 42M-parameter conditional variational autoencoder over SELFIES string representations of molecules. It learns a smooth 256-dimensional latent space you can sample, traverse, and optimize, and because it decodes SELFIES, close to 100% of generated strings are valid molecules. It was built for de-novo design of drug-like molecules and, in particular, battery-electrolyte solvents / additives across chemistries (Li / Na / K / Mg / Zn / …).
- Code / library: https://github.com/NealKapadia/molforge
- Weights (this repo):
checkpoints/best.pt - Architecture: embedding 512 → bidirectional GRU encoder (1024 × 2 layers) → latent 256 → GRU decoder (1024 × 2 layers), conditioned on 11 RDKit descriptors, with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- Training data: the Molport "All Stock" compound catalog (~6.1M drug-like molecules after filtering: heavy atoms 3–60, organic element set). Validity-critical fix: trained with the default SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate electrolyte motifs round-trip.
- Selected checkpoint:
best.pt= early-stopped epoch (~12M molecules seen), selected byval_token_acc + 0.25·valid_rate.
Conditioning properties (fixed order)
MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount
Evaluation (best.pt, 5,000 samples @ temperature 0.9)
| Metric | Value |
|---|---|
| Validity | 1.000 |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |
Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962, NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.
On the standard generative benchmark columns (validity / uniqueness / novelty / diversity) this model is competitive with — and on several columns exceeds — the autoregressive ElectrolyteGPT (Kim et al., JACS Au, 2026, 6, 2288–2302). The structural advantage is the latent space: smooth interpolation and gradient-based property optimization, which a left-to-right token model does not offer.
Files
checkpoints/best.pt # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json # SELFIES token vocabulary
processed/descriptor_stats.json # descriptor normalization (mean/std)
processed/meta.json # vocab size, max length, property order, constraints
This is exactly the layout the molforge library expects under MOLVAE_ART_DIR.
Usage
pip install "git+https://github.com/NealKapadia/molforge.git"
from huggingface_hub import snapshot_download
from molforge import MolForge
art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda"
mf.generate(10) # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip
mf.properties("CCO") # RDKit descriptors
Or set the path manually instead of artifacts_dir=:
export MOLVAE_ART_DIR=/path/to/download (Windows: $env:MOLVAE_ART_DIR="...").
Limitations & intended use
- Research / educational use for molecular design and screening — not a substitute for experimental validation, synthesis feasibility, or safety assessment.
- Soft conditioning:
spectargets nudge generation toward a value; they are not exact. For hard constraints, over-generate and filter by RDKit-computed properties. - The base generator is drug-like; very small electrolyte molecules (EC/DEC/MeCN) are somewhat out-of-distribution for the base checkpoint.
- The electrolyte property model has labeled data for Li / Na / K only; the generator proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs additional labeled data.
Training data & license
Trained on structures from the Molport "All Stock" catalog, which is licensed CC BY-NC 4.0 (Attribution–NonCommercial). Because these weights are a derivative of that data, they are released under the same CC BY-NC 4.0 license:
- Attribution: you must credit Molport as the source of the training data.
- NonCommercial: these weights may not be used for commercial purposes.
The MolForge source code (https://github.com/NealKapadia/molforge) is a separate work and is offered under its own (permissive) license; only the weights carry the CC BY-NC 4.0 restriction inherited from the training data.
Citation
If you use MolForge, please cite this repository and the SELFIES paper (Krenn et al., Mach. Learn.: Sci. Technol. 2020).