Molforge / README.md

Upload README.md with huggingface_hub

cf58410 verified 1 day ago

6.85 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- chemistry
	- molecule-generation
	- generative-model
	- vae
	- selfies
	- rdkit
	- drug-discovery
	- electrolyte
	- batteries
	- cheminformatics
	---

	# MolForge — a conditional SELFIES-VAE for de-novo molecule & battery-electrolyte design

	MolForge is a conditional variational autoencoder over
	[SELFIES](https://github.com/aspuru-guzik-group/selfies) representations of molecules, with
	about 42 million parameters (41,966,682), **trained on 7,116,053 molecules curated from five
	public chemistry databases** (Molport, ChEMBL, and ZINC for broad chemical coverage, plus
	electrolyte data from OEDB and CALiSol-23). It learns a smooth 256-dimensional latent space you
	can sample, traverse, and optimize, and because it decodes SELFIES, **essentially 100% of
	generated strings are valid molecules** (measured validity 1.000). It is purpose-built for
	de-novo battery-electrolyte design — generating candidate solvents and additives across
	chemistries (Li / Na / K / Mg / Zn / …) and ranking them with a paired electrolyte property
	model grounded in real electrolyte data.

	- Code / library: https://github.com/NealKapadia/molforge
	- Weights (this repo): `checkpoints/best.pt`
	- Architecture: embedding 512 → bidirectional GRU encoder (1024 × 2 layers) →
	latent 256 → GRU decoder (1024 × 2 layers), conditioned on 11 RDKit descriptors,
	with an auxiliary latent→property head. Decoder word-dropout 0.25 (Bowman et al.) for a
	meaningful latent. SELFIES robust alphabet, 79 tokens, max length 120.
	- Training data — 7,116,053 molecules from five public databases (filtered to 3–60 heavy
	atoms and an organic element set, then de-duplicated):

	\| Database \| Molecules \| Role \|
	\|---\|---\|---\|
	\| Molport "All Stock" \| 6,088,143 \| core corpus of purchasable molecules \|
	\| ChEMBL-37 (sample) \| 800,000 \| bioactive chemical diversity \|
	\| ZINC \| 227,902 \| additional lead-like diversity \|
	\| OEDB + CALiSol-23 (solvents) \| 8 \| electrolyte solvents in the generator \|
	\| Total \| 7,116,053 \| generative training set \|

	OEDB and CALiSol-23 additionally provide the electrolyte solvents and **18,918 electrolyte
	formulations** (conductivity, coordination, viscosity) that train the separate property model.
	Trained with the default SELFIES constraints (S=6 / P=5 allowed) so sulfonyl/phosphate
	electrolyte motifs round-trip.
	- Selected checkpoint: `best.pt`, selected by `val_token_acc + 0.25·valid_rate`.

	## Conditioning properties (fixed order)

	`MolWt, MolLogP, TPSA, QED, NumHDonors, NumHAcceptors, NumRotatableBonds, NumAromaticRings, NumRings, FractionCSP3, HeavyAtomCount`

	## Evaluation (best.pt, 5,000 samples @ temperature 0.9)

	\| Metric \| Value \|
	\|---\|---\|
	\| Validity \| 1.000 \|
	\| Uniqueness \| 0.998 \|
	\| Novelty (vs. training set) \| 0.995 \|
	\| Internal diversity \| 0.894 \|
	\| Reconstruction (exact) \| 0.945 \|
	\| Reconstruction (token acc) \| 0.998 \|

	Latent→property head R² (held-out): MolWt 0.994, TPSA 0.977, MolLogP 0.962,
	NumHAcceptors / NumRotatableBonds 0.969, QED 0.926, NumHDonors 0.922.

	> On the standard generative benchmark columns (validity / uniqueness / novelty /
	> diversity) this model is competitive with — and on several columns exceeds — the
	> autoregressive ElectrolyteGPT (Kim et al., JACS Au, 2026, 6, 2288–2302). The
	> structural advantage is the latent space: smooth interpolation and gradient-based
	> property optimization, which a left-to-right token model does not offer.

	## How MolForge differs from existing models

	- A latent space, not left-to-right text generation. Autoregressive models (ElectrolyteGPT,
	MolGPT) emit one token at a time. MolForge's VAE provides a continuous latent space you can
	interpolate and optimize with gradients (e.g. "increase molecular weight by 10 while
	keeping everything else") — a token model cannot.
	- Validity by construction. Decoding SELFIES yields essentially 100% valid molecules
	(measured 1.000), versus SMILES models that emit invalid strings.
	- A full inverse-design system, not just a generator. The generator is paired with a
	predictive model (Optuna-tuned), an electrolyte property model, optional LLM guidance, and
	literature-grounded retrieval — an end-to-end loop from a plain-English request to a ranked,
	scored candidate list.
	- Electrolyte-formulation awareness. Conductivity, coordination, and viscosity are system
	properties; MolForge models them at the formulation level (multi-cation), grounded in OEDB and
	CALiSol-23 data — most molecule generators ignore this.
	- Multi-database breadth. Trained across five public databases, not a single catalog.

	## Files

	```
	checkpoints/best.pt # the SELFIES-VAE generator weights
	checkpoints/electrolyte_model.pt # optional: formulation property model (conductivity etc.)
	processed/vocab.json # SELFIES token vocabulary
	processed/descriptor_stats.json # descriptor normalization (mean/std)
	processed/meta.json # vocab size, max length, property order, constraints
	```

	This is exactly the layout the `molforge` library expects under `MOLVAE_ART_DIR`.

	## Usage

	```bash
	pip install "git+https://github.com/NealKapadia/molforge.git"
	```

	```python
	from huggingface_hub import snapshot_download
	from molforge import MolForge

	art = snapshot_download("NealKapadia/Molforge") # downloads checkpoints/ + processed/
	mf = MolForge(device="cpu", artifacts_dir=art) # or device="cuda"

	mf.generate(10) # 10 valid, novel SMILES
	mf.generate(5, spec={"MolWt": 250, "QED": 0.8}) # property-targeted
	z = mf.encode("OCCN(CCO)CCO"); mf.decode(z) # latent round-trip
	mf.properties("CCO") # RDKit descriptors
	```

	Or set the path manually instead of `artifacts_dir=`:
	`export MOLVAE_ART_DIR=/path/to/download` (Windows: `$env:MOLVAE_ART_DIR="..."`).

	## Limitations & intended use

	- Research / educational use for molecular design and screening — not a substitute
	for experimental validation, synthesis feasibility, or safety assessment.
	- Soft conditioning: `spec` targets nudge generation toward a value; they are not exact.
	For hard constraints, over-generate and filter by RDKit-computed properties.
	- The generator covers a broad space of small-to-medium organic molecules; very small
	electrolyte molecules (EC/DEC/MeCN) sit at the edge of that distribution, so for tight
	electrolyte focus, specialize via fine-tuning plus the electrolyte property model.
	- The electrolyte property model has labeled data for Li / Na / K only; the generator
	proposes candidates for any chemistry, but quantitative ranking beyond Li/Na/K needs
	additional labeled data.