File size: 6,847 Bytes
72cd998 90dd0a2 4fa42ce 72cd998 4fa42ce bb7af93 84b702c 4710b8a 84b702c 4fa42ce 4710b8a 84b702c 4710b8a 84b702c 4710b8a 84b702c 4fa42ce 84b702c 4fa42ce 84b702c 4fa42ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- molecule-generation
- generative-model
- vae
- selfies
- rdkit
- drug-discovery
- electrolyte
- batteries
- cheminformatics
---
# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design
MolForge is a **conditional variational autoencoder** over
[SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with
about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five
public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus
electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you
can **sample, traverse, and optimize**, and because it decodes SELFIES, **essentially 100% of
generated strings are valid molecules** (measured validity 1.000). It is purpose-built for
**de-novo battery-electrolyte design** — generating candidate solvents and additives across
chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property
model grounded in real electrolyte data.
- **Code / library:** https://github.com/NealKapadia/molforge
- **Weights (this repo):** `checkpoints/best.pt`
- **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**,
with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- **Training data — 7,116,053 molecules from five public databases** (filtered to 3–60 heavy
atoms and an organic element set, then de-duplicated):
| Database | Molecules | Role |
|---|---|---|
| Molport "All Stock" | 6,088,143 | core corpus of purchasable molecules |
| ChEMBL-37 (sample) | 800,000 | bioactive chemical diversity |
| ZINC | 227,902 | additional lead-like diversity |
| OEDB + CALiSol-23 (solvents) | 8 | electrolyte solvents in the generator |
| **Total** | **7,116,053** | generative training set |
OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte
formulations** (conductivity, coordination, viscosity) that train the separate property model.
Trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
electrolyte motifs round-trip.
- **Selected checkpoint:** `best.pt`, selected by `val_token_acc + 0.25·valid_rate`.
## Conditioning properties (fixed order)
`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`
## Evaluation (best.pt, 5,000 samples @ temperature 0.9)
| Metric | Value |
|---|---|
| Validity | **1.000** |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |
Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.
> On the standard generative benchmark columns (validity / uniqueness / novelty /
> diversity) this model is competitive with — and on several columns exceeds — the
> autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The
> structural advantage is the **latent space**: smooth interpolation and gradient-based
> property optimization, which a left-to-right token model does not offer.
## How MolForge differs from existing models
- **A latent space, not left-to-right text generation.** Autoregressive models (ElectrolyteGPT,
MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can
**interpolate** and **optimize with gradients** (e.g. "increase molecular weight by 10 while
keeping everything else") — a token model cannot.
- **Validity by construction.** Decoding SELFIES yields essentially **100% valid** molecules
(measured 1.000), versus SMILES models that emit invalid strings.
- **A full inverse-design system, not just a generator.** The generator is paired with a
predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and
literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked,
scored candidate list.
- **Electrolyte-formulation awareness.** Conductivity, coordination, and viscosity are *system*
properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and
CALiSol-23 data — most molecule generators ignore this.
- **Multi-database breadth.** Trained across five public databases, not a single catalog.
## Files
```
checkpoints/best.pt # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json # SELFIES token vocabulary
processed/descriptor_stats.json # descriptor normalization (mean/std)
processed/meta.json # vocab size, max length, property order, constraints
```
This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.
## Usage
```bash
pip install "git+https://github.com/NealKapadia/molforge.git"
```
```python
from huggingface_hub import snapshot_download
from molforge import MolForge
art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda"
mf.generate(10) # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip
mf.properties("CCO") # RDKit descriptors
```
Or set the path manually instead of `artifacts_dir=`:
`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).
## Limitations & intended use
- **Research / educational use** for molecular design and screening — **not** a substitute
for experimental validation, synthesis feasibility, or safety assessment.
- **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact.
For hard constraints, over-generate and filter by RDKit-computed properties.
- The generator covers a broad space of **small-to-medium organic molecules**; very small
electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight
electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
- The electrolyte property model has labeled data for **Li / Na / K** only; the generator
proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
additional labeled data.
|