File size: 6,847 Bytes

---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - molecule-generation
  - generative-model
  - vae
  - selfies
  - rdkit
  - drug-discovery
  - electrolyte
  - batteries
  - cheminformatics
---

# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

MolForge is a **conditional variational autoencoder** over
[SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with
about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five
public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus
electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you
can **sample, traverse, and optimize**, and because it decodes SELFIES, **essentially 100% of
generated strings are valid molecules** (measured validity 1.000). It is purpose-built for
**de-novo battery-electrolyte design** — generating candidate solvents and additives across
chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property
model grounded in real electrolyte data.

- **Code / library:** https://github.com/NealKapadia/molforge
- **Weights (this repo):** `checkpoints/best.pt`
- **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
  latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**,
  with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
  meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- **Training data — 7,116,053 molecules from five public databases** (filtered to 3–60 heavy
  atoms and an organic element set, then de-duplicated):

  | Database | Molecules | Role |
  |---|---|---|
  | Molport "All Stock" | 6,088,143 | core corpus of purchasable molecules |
  | ChEMBL-37 (sample) | 800,000 | bioactive chemical diversity |
  | ZINC | 227,902 | additional lead-like diversity |
  | OEDB + CALiSol-23 (solvents) | 8 | electrolyte solvents in the generator |
  | **Total** | **7,116,053** | generative training set |

  OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte
  formulations** (conductivity, coordination, viscosity) that train the separate property model.
  Trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
  electrolyte motifs round-trip.
- **Selected checkpoint:** `best.pt`, selected by `val_token_acc + 0.25·valid_rate`.

## Conditioning properties (fixed order)

`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`

## Evaluation (best.pt, 5,000 samples @ temperature 0.9)

| Metric | Value |
|---|---|
| Validity | **1.000** |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |

Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

> On the standard generative benchmark columns (validity / uniqueness / novelty /
> diversity) this model is competitive with — and on several columns exceeds — the
> autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The
> structural advantage is the **latent space**: smooth interpolation and gradient-based
> property optimization, which a left-to-right token model does not offer.

## How MolForge differs from existing models

- **A latent space, not left-to-right text generation.** Autoregressive models (ElectrolyteGPT,
  MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can
  **interpolate** and **optimize with gradients** (e.g. "increase molecular weight by 10 while
  keeping everything else") — a token model cannot.
- **Validity by construction.** Decoding SELFIES yields essentially **100% valid** molecules
  (measured 1.000), versus SMILES models that emit invalid strings.
- **A full inverse-design system, not just a generator.** The generator is paired with a
  predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and
  literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked,
  scored candidate list.
- **Electrolyte-formulation awareness.** Conductivity, coordination, and viscosity are *system*
  properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and
  CALiSol-23 data — most molecule generators ignore this.
- **Multi-database breadth.** Trained across five public databases, not a single catalog.

## Files

```
checkpoints/best.pt              # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json             # SELFIES token vocabulary
processed/descriptor_stats.json  # descriptor normalization (mean/std)
processed/meta.json              # vocab size, max length, property order, constraints
```

This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.

## Usage

```bash
pip install "git+https://github.com/NealKapadia/molforge.git"
```

```python
from huggingface_hub import snapshot_download
from molforge import MolForge

art = snapshot_download("NealKapadia/Molforge")   # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art)    # or device="cuda"

mf.generate(10)                                   # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8})   # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z)       # latent round-trip
mf.properties("CCO")                              # RDKit descriptors
```

Or set the path manually instead of `artifacts_dir=`:
`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).

## Limitations & intended use

- **Research / educational use** for molecular design and screening — **not** a substitute
  for experimental validation, synthesis feasibility, or safety assessment.
- **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact.
  For hard constraints, over-generate and filter by RDKit-computed properties.
- The generator covers a broad space of **small-to-medium organic molecules**; very small
  electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight
  electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
- The electrolyte property model has labeled data for **Li / Na / K** only; the generator
  proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
  additional labeled data.