File size: 6,847 Bytes
72cd998
90dd0a2
4fa42ce
 
 
 
 
 
 
 
 
 
 
 
72cd998
4fa42ce
 
 
bb7af93
84b702c
4710b8a
84b702c
 
 
 
 
 
 
4fa42ce
 
 
 
 
 
 
4710b8a
84b702c
 
 
 
 
4710b8a
84b702c
4710b8a
 
84b702c
 
 
 
 
 
4fa42ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84b702c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fa42ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84b702c
 
 
4fa42ce
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - molecule-generation
  - generative-model
  - vae
  - selfies
  - rdkit
  - drug-discovery
  - electrolyte
  - batteries
  - cheminformatics
---

# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

MolForge is a **conditional variational autoencoder** over
[SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with
about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five
public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus
electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you
can **sample, traverse, and optimize**, and because it decodes SELFIES, **essentially 100% of
generated strings are valid molecules** (measured validity 1.000). It is purpose-built for
**de-novo battery-electrolyte design** — generating candidate solvents and additives across
chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property
model grounded in real electrolyte data.

- **Code / library:** https://github.com/NealKapadia/molforge
- **Weights (this repo):** `checkpoints/best.pt`
- **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
  latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**,
  with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
  meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- **Training data — 7,116,053 molecules from five public databases** (filtered to 3–60 heavy
  atoms and an organic element set, then de-duplicated):

  | Database | Molecules | Role |
  |---|---|---|
  | Molport "All Stock" | 6,088,143 | core corpus of purchasable molecules |
  | ChEMBL-37 (sample) | 800,000 | bioactive chemical diversity |
  | ZINC | 227,902 | additional lead-like diversity |
  | OEDB + CALiSol-23 (solvents) | 8 | electrolyte solvents in the generator |
  | **Total** | **7,116,053** | generative training set |

  OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte
  formulations** (conductivity, coordination, viscosity) that train the separate property model.
  Trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
  electrolyte motifs round-trip.
- **Selected checkpoint:** `best.pt`, selected by `val_token_acc + 0.25·valid_rate`.

## Conditioning properties (fixed order)

`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`

## Evaluation (best.pt, 5,000 samples @ temperature 0.9)

| Metric | Value |
|---|---|
| Validity | **1.000** |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |

Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

> On the standard generative benchmark columns (validity / uniqueness / novelty /
> diversity) this model is competitive with — and on several columns exceeds — the
> autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The
> structural advantage is the **latent space**: smooth interpolation and gradient-based
> property optimization, which a left-to-right token model does not offer.

## How MolForge differs from existing models

- **A latent space, not left-to-right text generation.** Autoregressive models (ElectrolyteGPT,
  MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can
  **interpolate** and **optimize with gradients** (e.g. "increase molecular weight by 10 while
  keeping everything else") — a token model cannot.
- **Validity by construction.** Decoding SELFIES yields essentially **100% valid** molecules
  (measured 1.000), versus SMILES models that emit invalid strings.
- **A full inverse-design system, not just a generator.** The generator is paired with a
  predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and
  literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked,
  scored candidate list.
- **Electrolyte-formulation awareness.** Conductivity, coordination, and viscosity are *system*
  properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and
  CALiSol-23 data — most molecule generators ignore this.
- **Multi-database breadth.** Trained across five public databases, not a single catalog.

## Files

```
checkpoints/best.pt              # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json             # SELFIES token vocabulary
processed/descriptor_stats.json  # descriptor normalization (mean/std)
processed/meta.json              # vocab size, max length, property order, constraints
```

This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.

## Usage

```bash
pip install "git+https://github.com/NealKapadia/molforge.git"
```

```python
from huggingface_hub import snapshot_download
from molforge import MolForge

art = snapshot_download("NealKapadia/Molforge")   # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art)    # or device="cuda"

mf.generate(10)                                   # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8})   # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z)       # latent round-trip
mf.properties("CCO")                              # RDKit descriptors
```

Or set the path manually instead of `artifacts_dir=`:
`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).

## Limitations & intended use

- **Research / educational use** for molecular design and screening — **not** a substitute
  for experimental validation, synthesis feasibility, or safety assessment.
- **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact.
  For hard constraints, over-generate and filter by RDKit-computed properties.
- The generator covers a broad space of **small-to-medium organic molecules**; very small
  electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight
  electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
- The electrolyte property model has labeled data for **Li / Na / K** only; the generator
  proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
  additional labeled data.