| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - chemistry |
| - molecule-generation |
| - generative-model |
| - vae |
| - selfies |
| - rdkit |
| - drug-discovery |
| - electrolyte |
| - batteries |
| - cheminformatics |
| --- |
| |
| # MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design |
|
|
| MolForge is a **conditional variational autoencoder** over |
| [SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with |
| about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five |
| public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus |
| electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you |
| can **sample, traverse, and optimize**, and because it decodes SELFIES, **essentially 100% of |
| generated strings are valid molecules** (measured validity 1.000). It is purpose-built for |
| **de-novo battery-electrolyte design** — generating candidate solvents and additives across |
| chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property |
| model grounded in real electrolyte data. |
|
|
| - **Code / library:** https://github.com/NealKapadia/molforge |
| - **Weights (this repo):** `checkpoints/best.pt` |
| - **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) → |
| latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**, |
| with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a |
| meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120. |
| - **Training data — 7,116,053 molecules from five public databases** (filtered to 3–60 heavy |
| atoms and an organic element set, then de-duplicated): |
|
|
| | Database | Molecules | Role | |
| |---|---|---| |
| | Molport "All Stock" | 6,088,143 | core corpus of purchasable molecules | |
| | ChEMBL-37 (sample) | 800,000 | bioactive chemical diversity | |
| | ZINC | 227,902 | additional lead-like diversity | |
| | OEDB + CALiSol-23 (solvents) | 8 | electrolyte solvents in the generator | |
| | **Total** | **7,116,053** | generative training set | |
|
|
| OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte |
| formulations** (conductivity, coordination, viscosity) that train the separate property model. |
| Trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate |
| electrolyte motifs round-trip. |
| - **Selected checkpoint:** `best.pt`, selected by `val_token_acc + 0.25·valid_rate`. |
|
|
| ## Conditioning properties (fixed order) |
|
|
| `MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount` |
|
|
| ## Evaluation (best.pt, 5,000 samples @ temperature 0.9) |
|
|
| | Metric | Value | |
| |---|---| |
| | Validity | **1.000** | |
| | Uniqueness | 0.998 | |
| | Novelty (vs. training set) | 0.995 | |
| | Internal diversity | 0.894 | |
| | Reconstruction (exact) | 0.945 | |
| | Reconstruction (token acc) | 0.998 | |
|
|
| Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962, |
| NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922. |
|
|
| > On the standard generative benchmark columns (validity / uniqueness / novelty / |
| > diversity) this model is competitive with — and on several columns exceeds — the |
| > autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The |
| > structural advantage is the **latent space**: smooth interpolation and gradient-based |
| > property optimization, which a left-to-right token model does not offer. |
|
|
| ## How MolForge differs from existing models |
|
|
| - **A latent space, not left-to-right text generation.** Autoregressive models (ElectrolyteGPT, |
| MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can |
| **interpolate** and **optimize with gradients** (e.g. "increase molecular weight by 10 while |
| keeping everything else") — a token model cannot. |
| - **Validity by construction.** Decoding SELFIES yields essentially **100% valid** molecules |
| (measured 1.000), versus SMILES models that emit invalid strings. |
| - **A full inverse-design system, not just a generator.** The generator is paired with a |
| predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and |
| literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked, |
| scored candidate list. |
| - **Electrolyte-formulation awareness.** Conductivity, coordination, and viscosity are *system* |
| properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and |
| CALiSol-23 data — most molecule generators ignore this. |
| - **Multi-database breadth.** Trained across five public databases, not a single catalog. |
|
|
| ## Files |
|
|
| ``` |
| checkpoints/best.pt # the SELFIES-VAE generator weights |
| checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.) |
| processed/vocab.json # SELFIES token vocabulary |
| processed/descriptor_stats.json # descriptor normalization (mean/std) |
| processed/meta.json # vocab size, max length, property order, constraints |
| ``` |
|
|
| This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`. |
|
|
| ## Usage |
|
|
| ```bash |
| pip install "git+https://github.com/NealKapadia/molforge.git" |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from molforge import MolForge |
| |
| art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/ |
| mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda" |
| |
| mf.generate(10) # 10 valid, novel SMILES |
| mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted |
| z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip |
| mf.properties("CCO") # RDKit descriptors |
| ``` |
|
|
| Or set the path manually instead of `artifacts_dir=`: |
| `export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`). |
|
|
| ## Limitations & intended use |
|
|
| - **Research / educational use** for molecular design and screening — **not** a substitute |
| for experimental validation, synthesis feasibility, or safety assessment. |
| - **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact. |
| For hard constraints, over-generate and filter by RDKit-computed properties. |
| - The generator covers a broad space of **small-to-medium organic molecules**; very small |
| electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight |
| electrolyte focus, specialize via fine-tuning plus the electrolyte property model. |
| - The electrolyte property model has labeled data for **Li / Na / K** only; the generator |
| proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs |
| additional labeled data. |
|
|