Molforge / README.md

Update README.md

387c99c verified about 11 hours ago

5.5 kB

license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - molecule-generation
  - generative-model
  - vae
  - selfies
  - rdkit
  - drug-discovery
  - electrolyte
  - batteries
  - cheminformatics

MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

MolForge is a 42M-parameter conditional variational autoencoder over SELFIES string representations of molecules. It learns a smooth 256-dimensional latent space you can sample, traverse, and optimize, and because it decodes SELFIES, close to 100% of generated strings are valid molecules. It was built for de-novo design of drug-like molecules and, in particular, battery-electrolyte solvents / additives across chemistries (Li / Na / K / Mg / Zn / …).

Code / library: https://github.com/NealKapadia/molforge
Weights (this repo): checkpoints/best.pt
Architecture: embedding 512 → bidirectional GRU encoder (1024 × 2 layers) → latent 256 → GRU decoder (1024 × 2 layers), conditioned on 11 RDKit descriptors, with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
Training data: the Molport "All Stock" compound catalog (~6.1M drug-like molecules after filtering: heavy atoms 3–60, organic element set). Validity-critical fix: trained with the default SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate electrolyte motifs round-trip.
Selected checkpoint: best.pt = early-stopped epoch (~12M molecules seen), selected by val_token_acc + 0.25·valid_rate.

Conditioning properties (fixed order)

MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount

Evaluation (best.pt, 5,000 samples @ temperature 0.9)

Metric	Value
Validity	1.000
Uniqueness	0.998
Novelty (vs. training set)	0.995
Internal diversity	0.894
Reconstruction (exact)	0.945
Reconstruction (token acc)	0.998

Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962, NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

On the standard generative benchmark columns (validity / uniqueness / novelty / diversity) this model is competitive with — and on several columns exceeds — the autoregressive ElectrolyteGPT (Kim et al., JACS Au, 2026, 6, 2288–2302). The structural advantage is the latent space: smooth interpolation and gradient-based property optimization, which a left-to-right token model does not offer.

Files

checkpoints/best.pt              # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json             # SELFIES token vocabulary
processed/descriptor_stats.json  # descriptor normalization (mean/std)
processed/meta.json              # vocab size, max length, property order, constraints

This is exactly the layout the molforge library expects under MOLVAE_ART_DIR.

Usage

pip install "git+https://github.com/NealKapadia/molforge.git"

from huggingface_hub import snapshot_download
from molforge import MolForge

art = snapshot_download("NealKapadia/Molforge")   # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art)    # or device="cuda"

mf.generate(10)                                   # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8})   # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z)       # latent round-trip
mf.properties("CCO")                              # RDKit descriptors

Or set the path manually instead of artifacts_dir=: export MOLVAE_ART_DIR=/path/to/download (Windows: $env:MOLVAE_ART_DIR="...").

Limitations & intended use

Research / educational use for molecular design and screening — not a substitute for experimental validation, synthesis feasibility, or safety assessment.
Soft conditioning: spec targets nudge generation toward a value; they are not exact. For hard constraints, over-generate and filter by RDKit-computed properties.
The base generator is drug-like; very small electrolyte molecules (EC/DEC/MeCN) are somewhat out-of-distribution for the base checkpoint.
The electrolyte property model has labeled data for Li / Na / K only; the generator proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs additional labeled data.

Training data & license

Trained on structures from the Molport "All Stock" catalog, which is licensed CC BY-NC 4.0 (Attribution–NonCommercial). Because these weights are a derivative of that data, they are released under the same CC BY-NC 4.0 license:

Attribution: you must credit Molport as the source of the training data.
NonCommercial: these weights may not be used for commercial purposes.

The MolForge source code (https://github.com/NealKapadia/molforge) is a separate work and is offered under its own (permissive) license; only the weights carry the CC BY-NC 4.0 restriction inherited from the training data.

Citation

If you use MolForge, please cite this repository and the SELFIES paper (Krenn et al., Mach. Learn.: Sci. Technol. 2020).