Molforge / README.md
NealKapadia's picture
Upload README.md with huggingface_hub
cf58410 verified
|
Raw
History Blame Contribute Delete
6.85 kB
---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- molecule-generation
- generative-model
- vae
- selfies
- rdkit
- drug-discovery
- electrolyte
- batteries
- cheminformatics
---
# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design
MolForge is a **conditional variational autoencoder** over
[SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with
about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five
public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus
electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you
can **sample, traverse, and optimize**, and because it decodes SELFIES, **essentially 100% of
generated strings are valid molecules** (measured validity 1.000). It is purpose-built for
**de-novo battery-electrolyte design** — generating candidate solvents and additives across
chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property
model grounded in real electrolyte data.
- **Code / library:** https://github.com/NealKapadia/molforge
- **Weights (this repo):** `checkpoints/best.pt`
- **Architecture:** embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
latent 256 → GRU decoder (1024 × 2 layers), conditioned on **11 RDKit descriptors**,
with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
- **Training data — 7,116,053 molecules from five public databases** (filtered to 3–60 heavy
atoms and an organic element set, then de-duplicated):
| Database | Molecules | Role |
|---|---|---|
| Molport "All Stock" | 6,088,143 | core corpus of purchasable molecules |
| ChEMBL-37 (sample) | 800,000 | bioactive chemical diversity |
| ZINC | 227,902 | additional lead-like diversity |
| OEDB + CALiSol-23 (solvents) | 8 | electrolyte solvents in the generator |
| **Total** | **7,116,053** | generative training set |
OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte
formulations** (conductivity, coordination, viscosity) that train the separate property model.
Trained with the **default** SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
electrolyte motifs round-trip.
- **Selected checkpoint:** `best.pt`, selected by `val_token_acc + 0.25·valid_rate`.
## Conditioning properties (fixed order)
`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`
## Evaluation (best.pt, 5,000 samples @ temperature 0.9)
| Metric | Value |
|---|---|
| Validity | **1.000** |
| Uniqueness | 0.998 |
| Novelty (vs. training set) | 0.995 |
| Internal diversity | 0.894 |
| Reconstruction (exact) | 0.945 |
| Reconstruction (token acc) | 0.998 |
Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.
> On the standard generative benchmark columns (validity / uniqueness / novelty /
> diversity) this model is competitive with — and on several columns exceeds — the
> autoregressive **ElectrolyteGPT** (Kim et al., *JACS Au*, 2026, 6, 2288–2302). The
> structural advantage is the **latent space**: smooth interpolation and gradient-based
> property optimization, which a left-to-right token model does not offer.
## How MolForge differs from existing models
- **A latent space, not left-to-right text generation.** Autoregressive models (ElectrolyteGPT,
MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can
**interpolate** and **optimize with gradients** (e.g. "increase molecular weight by 10 while
keeping everything else") — a token model cannot.
- **Validity by construction.** Decoding SELFIES yields essentially **100% valid** molecules
(measured 1.000), versus SMILES models that emit invalid strings.
- **A full inverse-design system, not just a generator.** The generator is paired with a
predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and
literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked,
scored candidate list.
- **Electrolyte-formulation awareness.** Conductivity, coordination, and viscosity are *system*
properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and
CALiSol-23 data — most molecule generators ignore this.
- **Multi-database breadth.** Trained across five public databases, not a single catalog.
## Files
```
checkpoints/best.pt # the SELFIES-VAE generator weights
checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
processed/vocab.json # SELFIES token vocabulary
processed/descriptor_stats.json # descriptor normalization (mean/std)
processed/meta.json # vocab size, max length, property order, constraints
```
This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.
## Usage
```bash
pip install "git+https://github.com/NealKapadia/molforge.git"
```
```python
from huggingface_hub import snapshot_download
from molforge import MolForge
art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/
mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda"
mf.generate(10) # 10 valid, novel SMILES
mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted
z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip
mf.properties("CCO") # RDKit descriptors
```
Or set the path manually instead of `artifacts_dir=`:
`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).
## Limitations & intended use
- **Research / educational use** for molecular design and screening — **not** a substitute
for experimental validation, synthesis feasibility, or safety assessment.
- **Soft conditioning:** `spec` targets nudge generation toward a value; they are not exact.
For hard constraints, over-generate and filter by RDKit-computed properties.
- The generator covers a broad space of **small-to-medium organic molecules**; very small
electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight
electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
- The electrolyte property model has labeled data for **Li / Na / K** only; the generator
proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
additional labeled data.