MorganGen / README.md
lamthuy's picture
Upload folder using huggingface_hub
f0a96d3 verified
---
license: mit
---
## MorganGen
To use it you can clone the HF gitrepo before running the following examples:
```commandline
git lfs install # Only once if not done already
git clone https://huggingface.co/lamthuy/MorganGen
```
A generative model trained on 120 million SMILES strings from the ZINC database. The model takes as input a sequence of indices representing the active bits in a 2048-bit Morgan fingerprint. Each index corresponds to a bit set to 1, while all other bits are 0.
```
s = [12][184][1200]
```
represents a fingerprint where only bits 12, 184, and 1200 are set to 1, and the remaining bits are 0.
# Running example
The following code snippet in the notebook demonstrates how to load the model from a checkpoint and generate a new SMILES string, conditioned on a given input SMILES.
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from utils import MorganFingerprint, morgan_fingerprint_to_text
# Load the checkpoint and the tokenizer
checkpoint_path = "lamthuy/MorganGen"
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
# Given a SMILES, get its fingerpint
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
m = MorganFingerprint()
mf = m.smiles_to_morgan(smiles)
# convert it to the indices text format
s = morgan_fingerprint_to_text(mf)
# encode
input_ids = tokenizer.encode(s, return_tensors="pt")
# Generate output sequence
output_ids = model.generate(input_ids, max_length=64, num_beams=5)
# Decode the generated output
output_smiles = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_smiles)
```
# Reference
```
@inproceedings{hoang2024morgangen,
title={MorganGen: Generative Modeling of SMILES Using Morgan Fingerprint Features},
author={Hoang, Lam Thanh and D{\'\i}az, Ra{\'u}l Fern{\'a}ndez and Lopez, Vanessa},
booktitle={American Chemical Society (ACS) Fall Meeting},
year={2024}
}
```