---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - molecule-generation
  - generative-model
  - vae
  - selfies
  - rdkit
  - drug-discovery
  - electrolyte
  - batteries
  - cheminformatics
---

# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

MolForge is a 42M-parameter **conditional variational autoencoder** over
[SELFIES](https://github.com/aspuru-guzik-group/selfies) string representations of
molecules. It learns a smooth 256-dimensional latent space you can **sample, traverse,
and optimize**, and because it decodes SELFIES, close to **100% of generated strings are valid
molecules**. It was built for de-novo design of drug-like molecules and, in particular,
**battery-electrolyte solvents / additives** across chemistries (Li / Na / K / Mg / Zn / …).

- **Code / library:** https://github.com/NealKapadia/molforge
- **Weights (this repo):** `checkpoints/best.pt`
- **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
  latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**,
  with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
  meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- **Training data:** the Molport "All Stock" compound catalog (~6.1M drug-like molecules
  after filtering: heavy atoms 3–60, organic element set). Validity-critical fix: trained
  with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
  electrolyte motifs round-trip.
- **Selected checkpoint:** `best.pt` = early-stopped epoch (~12M molecules seen), selected by
  `val_token_acc + 0.25·valid_rate`.

## Conditioning properties (fixed order)

`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`

## Evaluation (best.pt, 5,000 samples @ temperature 0.9)

| Metric | Value |
|---|---|
| Validity | **1.000** |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |

Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

> On the standard generative benchmark columns (validity / uniqueness / novelty /
> diversity) this model is competitive with — and on several columns exceeds — the
> autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The
> structural advantage is the **latent space**: smooth interpolation and gradient-based
> property optimization, which a left-to-right token model does not offer.

## Files

```
checkpoints/best.pt              # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json             # SELFIES token vocabulary
processed/descriptor_stats.json  # descriptor normalization (mean/std)
processed/meta.json              # vocab size, max length, property order, constraints
```

This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.

## Usage

```bash
pip install "git+https://github.com/NealKapadia/molforge.git"
```

```python
from huggingface_hub import snapshot_download
from molforge import MolForge

art = snapshot_download("NealKapadia/Molforge")   # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art)    # or device="cuda"

mf.generate(10)                                   # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8})   # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z)       # latent round-trip
mf.properties("CCO")                              # RDKit descriptors
```

Or set the path manually instead of `artifacts_dir=`:
`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).

## Limitations & intended use

- **Research / educational use** for molecular design and screening — **not** a substitute
  for experimental validation, synthesis feasibility, or safety assessment.
- **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact.
  For hard constraints, over-generate and filter by RDKit-computed properties.
- The base generator is **drug-like**; very small electrolyte molecules (EC/DEC/MeCN) are
  somewhat out-of-distribution for the base checkpoint.
- The electrolyte property model has labeled data for **Li / Na / K** only; the generator
  proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
  additional labeled data.

## Training data & license

Trained on structures from the **Molport "All Stock"** catalog, which is licensed
**CC BY-NC 4.0** (Attribution–NonCommercial). Because these weights are a derivative of that
data, they are released under the **same CC BY-NC 4.0** license:

- **Attribution:** you must credit Molport as the source of the training data.
- **NonCommercial:** these weights may not be used for commercial purposes.

The MolForge *source code* (https://github.com/NealKapadia/molforge) is a separate work and
is offered under its own (permissive) license; only the **weights** carry the CC BY-NC 4.0
restriction inherited from the training data.

## Citation

If you use MolForge, please cite this repository and the SELFIES paper
(Krenn et al., *Mach. Learn.: Sci. Technol.* 2020).