--- license: cc-by-nc-4.0 library_name: pytorch tags: - chemistry - molecule-generation - generative-model - vae - selfies - rdkit - drug-discovery - electrolyte - batteries - cheminformatics --- # MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design MolForge is a 42M-parameter **conditional variational autoencoder** over [SELFIES](https://github.com/aspuru-guzik-group/selfies) string representations of molecules. It learns a smooth 256-dimensional latent space you can **sample, traverse, and optimize**, and because it decodes SELFIES, close to **100% of generated strings are valid molecules**. It was built for de-novo design of drug-like molecules and, in particular, **battery-electrolyte solvents / additives** across chemistries (Li / Na / K / Mg / Zn / …). - **Code / library:** https://github.com/NealKapadia/molforge - **Weights (this repo):** `checkpoints/best.pt` - **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) → latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**, with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120. - **Training data:** the Molport "All Stock" compound catalog (~6.1M drug-like molecules after filtering: heavy atoms 3–60, organic element set). Validity-critical fix: trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate electrolyte motifs round-trip. - **Selected checkpoint:** `best.pt` = early-stopped epoch (~12M molecules seen), selected by `val_token_acc + 0.25·valid_rate`. ## Conditioning properties (fixed order) `MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount` ## Evaluation (best.pt, 5,000 samples @ temperature 0.9) | Metric | Value | |---|---| | Validity | **1.000** | | Uniqueness | 0.998 | | Novelty (vs. training set) | 0.995 | | Internal diversity | 0.894 | | Reconstruction (exact) | 0.945 | | Reconstruction (token acc) | 0.998 | Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962, NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922. > On the standard generative benchmark columns (validity / uniqueness / novelty / > diversity) this model is competitive with — and on several columns exceeds — the > autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The > structural advantage is the **latent space**: smooth interpolation and gradient-based > property optimization, which a left-to-right token model does not offer. ## Files ``` checkpoints/best.pt # the SELFIES-VAE generator weights checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.) processed/vocab.json # SELFIES token vocabulary processed/descriptor_stats.json # descriptor normalization (mean/std) processed/meta.json # vocab size, max length, property order, constraints ``` This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`. ## Usage ```bash pip install "git+https://github.com/NealKapadia/molforge.git" ``` ```python from huggingface_hub import snapshot_download from molforge import MolForge art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/ mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda" mf.generate(10) # 10 valid, novel SMILES mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip mf.properties("CCO") # RDKit descriptors ``` Or set the path manually instead of `artifacts_dir=`: `export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`). ## Limitations & intended use - **Research / educational use** for molecular design and screening — **not** a substitute for experimental validation, synthesis feasibility, or safety assessment. - **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact. For hard constraints, over-generate and filter by RDKit-computed properties. - The base generator is **drug-like**; very small electrolyte molecules (EC/DEC/MeCN) are somewhat out-of-distribution for the base checkpoint. - The electrolyte property model has labeled data for **Li / Na / K** only; the generator proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs additional labeled data. ## Training data & license Trained on structures from the **Molport "All Stock"** catalog, which is licensed **CC BY-NC 4.0** (Attribution–NonCommercial). Because these weights are a derivative of that data, they are released under the **same CC BY-NC 4.0** license: - **Attribution:** you must credit Molport as the source of the training data. - **NonCommercial:** these weights may not be used for commercial purposes. The MolForge *source code* (https://github.com/NealKapadia/molforge) is a separate work and is offered under its own (permissive) license; only the **weights** carry the CC BY-NC 4.0 restriction inherited from the training data. ## Citation If you use MolForge, please cite this repository and the SELFIES paper (Krenn et al., *Mach. Learn.: Sci. Technol.* 2020).