Membrizard's picture
Update README.md
3e4e46a verified
|
raw
history blame
6.92 kB
---
license: cc-by-nc-nd-4.0
extra_gated_prompt: "By submitting any personal information (e.g., name, contact details), you agree to the collection and processing of this data
for the purpose of evaluating access requests for this model. Repository authors will store this data securely and will not share it with third parties
without your explicit consent. You retain all rights to your personal information and may request its deletion at any time.\n\n
By accessing the repository you agree not to use this model in experiments which may result in harm to human or animal subjects.
"
extra_gated_fields:
Date of Agreement: date_picker
I accept the terms of the license and I agree not to use this model for commercial purposes or profit generation: checkbox
tags:
- molecular-generation
- diffusion-models
- cheminformatics
- 3D-conformer
- rdkit
- non-commercial
language: en
library_name: mlconfgen
datasets:
- ChEMBL
metrics:
- shape-tanimoto
- validity
- uniqueness
- novelty
- Fréchet Distance
model-index:
- name: ML Conformer Generator
results:
- task:
type: molecular-generation
name: 3D Conformer Generation
dataset:
name: ChEMBL (filtered)
type: molecules
metrics:
- name: Valid molecules
type: validity
value: 48-93%
- name: Chemical novelty
type: novelty
value: 99.84%
- name: Shape Tanimoto Similarity (avg)
type: shape-tanimoto
value: 53.32%
- name: Shape Tanimoto Similarity (max)
type: shape-tanimoto
value: 99.69%
- name: Average Synthesis Access score
type: sa_score
value: 3.18
- name: Unique molecules
type: uniqueness
value: 99.94%
- name: Fréchet Fingerprint Distance
type: Fréchet Distance
value: 4.13
---
# ML Conformer Generator
[![DOI](https://img.shields.io/badge/DOI-10.1039%2FD5DD00318K-blue)](https://doi.org/10.1039/D5DD00318K)
<img src="./mlconfgen_logo.png" width="200" style="display: block; margin: 0 10%;">
**ML Conformer Generator** is a shape-constrained molecule generation model that combines
an Equivariant Diffusion Model (EDM) and Graph Convolutional Network (GCN). It generates 3D conformations
that are chemically valid and geometrically aligned with a reference shape.
---
## 📦 Model Summary
- **Architecture**: Equivariant Diffusion Model (EDM) + Graph Convolutional Network (GCN)
- **Training Data**: 1.6 million ChEMBL compounds, filtered for molecules with 15–39 heavy atoms
- **Post-Processing**: Deterministic standardization pipeline using RDKit with constrained MMFF94 geometry optimization
- **Primary Metric**: Shape Tanimoto Similarity
- **Developed by:** Denis Sapegin
---
## 🚀 Intended Use
- Non-Commercial Research in 3D molecular generation
- Academic/educational use
- Generation of molecules similar to a reference conformer
- Generation of molecules similar to a reference arbitrary shape
---
## 🚫 Out of Scope / Limitations
- **Commercial Use**: Not licensed for commercial use without explicit permission.
- **Training Bias**: Trained on ChEMBL data — results may be biased toward drug-like molecules and chemistries.
- **Elements Supported**: Only the following elements are supported for generation: `H`, `C`, `N`, `O`, `F`, `P`, `S`, `Cl`, `Br`.
- **Molecular Size Limitations**:
- Trained on molecules containing **15–39 heavy atoms**.
- By architectural design, the model can **only generate molecules with up to 42 heavy atoms**.
---
## 🧪 Evaluation Metrics (100,000 requested samples, 100 denoising steps)
-**Valid molecules (post-standardization, % from requested)**: 48%
- 🧬 **Chemical novelty**: 99.84%
- 📐 **Avg Shape Tanimoto**: 53.32%
- 🎯 **Max Shape Tanimoto**: 99.69%
- 🔁 **Unique molecules**: 99.94%
-**Generation speed**: 4.18 valid molecules/sec (NVIDIA H100)
- 💾 **Memory (per thread)**: up to 4.0 GB
- 🧬 **Fréchet Fingerprint Distance (to ChEMBL)**: 4.13
---
## 🧠 How It Works
### Core Components:
- **EDM** generates atom coordinates and types under shape constraints
- **GCN** predicts adjacency matrices (bonding)
- **RDKit** pipeline enforces valence, performs sanitization, and optimizes geometry
### Shape Alignment:
Evaluated using **Gaussian molecular volume overlap** and **Shape Tanimoto Similarity**.
Hydrogens are excluded from similarity computation.
---
## 💾 Access & Licensing
The **Python package and inference code are available on GitHub** under Apache 2.0 License
> https://github.com/Membrizard/ml_conformer_generator
The trained model **Weights** are available at
> https://huggingface.co/Membrizard/ml_conformer_generator
And are licensed under CC BY-NC-ND 4.0
The usage of the trained weights for any profit-generating activity is restricted.
For commercial licensing and inference-as-a-service, contact:
[Denis Sapegin](https://github.com/Membrizard)
---
## Citation
If you use **MLConfGen** in your research, please cite:
Denis Sapegin, Fedor Bakharev, Dmitry Krupenya, Azamat Gafurov, Konstantin Pildish, and Joseph C. Bear.
*Moment of inertia as a simple shape descriptor for diffusion-based shape-constrained molecular generation.*
Digital Discovery, 2025.
DOI: [10.1039/D5DD00318K](https://doi.org/10.1039/D5DD00318K)
---
## Installation
1. Install the package:
`pip install mlconfgen`
2. Load the weights from Huggingface
> https://huggingface.co/Membrizard/ml_conformer_generator
**PyTorch**
`edm_moi_chembl_15_39.pt`
`adj_mat_seer_chembl_15_39.pt`
**ONNX**
`edm_moi_chembl_15_39.onnx`
`adj_mat_seer_chembl_15_39.onnx`
---
## 🐍 Python API
**PyTorch**
```python
from rdkit import Chem
from mlconfgen import MLConformerGenerator, evaluate_samples
model = MLConformerGenerator(
edm_weights="./edm_moi_chembl_15_39.pt",
adj_mat_seer_weights="./adj_mat_seer_chembl_15_39.pt",
diffusion_steps=100,
)
reference = Chem.MolFromMolFile('ceyyag.mol')
samples = model.generate_conformers(reference_conformer=reference, n_samples=20, variance=2)
aligned_reference, std_samples = evaluate_samples(reference, samples)
```
---
**ONNX**
```python
from mlconfgen import MLConformerGeneratorONNX
from rdkit import Chem
model = MLConformerGeneratorONNX(
egnn_onnx="./egnn_chembl_15_39.onnx",
adj_mat_seer_onnx="./adj_mat_seer_chembl_15_39.onnx",
diffusion_steps=100,
)
reference = Chem.MolFromMolFile('ceyyag.mol')
samples = model.generate_conformers(reference_conformer=reference, n_samples=20, variance=2)
```