--- license: mit --- ## MorganGen To use it you can clone the HF gitrepo before running the following examples: ```commandline git lfs install # Only once if not done already git clone https://huggingface.co/lamthuy/MorganGen ``` A generative model trained on 120 million SMILES strings from the ZINC database. The model takes as input a sequence of indices representing the active bits in a 2048-bit Morgan fingerprint. Each index corresponds to a bit set to 1, while all other bits are 0. ``` s = [12][184][1200] ``` represents a fingerprint where only bits 12, 184, and 1200 are set to 1, and the remaining bits are 0. # Running example The following code snippet in the notebook demonstrates how to load the model from a checkpoint and generate a new SMILES string, conditioned on a given input SMILES. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from utils import MorganFingerprint, morgan_fingerprint_to_text # Load the checkpoint and the tokenizer checkpoint_path = "lamthuy/MorganGen" model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path) tokenizer = AutoTokenizer.from_pretrained(checkpoint_path) # Given a SMILES, get its fingerpint smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" m = MorganFingerprint() mf = m.smiles_to_morgan(smiles) # convert it to the indices text format s = morgan_fingerprint_to_text(mf) # encode input_ids = tokenizer.encode(s, return_tensors="pt") # Generate output sequence output_ids = model.generate(input_ids, max_length=64, num_beams=5) # Decode the generated output output_smiles = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(output_smiles) ``` # Reference ``` @inproceedings{hoang2024morgangen, title={MorganGen: Generative Modeling of SMILES Using Morgan Fingerprint Features}, author={Hoang, Lam Thanh and D{\'\i}az, Ra{\'u}l Fern{\'a}ndez and Lopez, Vanessa}, booktitle={American Chemical Society (ACS) Fall Meeting}, year={2024} } ```