File size: 2,006 Bytes
f0a96d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---

license: mit
---

## MorganGen
To use it you can clone the HF gitrepo before running the following examples:

```commandline

git lfs install  # Only once if not done already

git clone https://huggingface.co/lamthuy/MorganGen

```

A generative model trained on 120 million SMILES strings from the ZINC database. The model takes as input a sequence of indices representing the active bits in a 2048-bit Morgan fingerprint. Each index corresponds to a bit set to 1, while all other bits are 0.
```

s = [12][184][1200]

```
represents a fingerprint where only bits 12, 184, and 1200 are set to 1, and the remaining bits are 0.
# Running example
The following code snippet in the notebook demonstrates how to load the model from a checkpoint and generate a new SMILES string, conditioned on a given input SMILES.

```python

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

from utils import MorganFingerprint, morgan_fingerprint_to_text





# Load the checkpoint and the tokenizer

checkpoint_path = "lamthuy/MorganGen"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path)

tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)



# Given a SMILES, get its fingerpint

smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"

m = MorganFingerprint()

mf = m.smiles_to_morgan(smiles)



# convert it to the indices text format

s = morgan_fingerprint_to_text(mf)



# encode

input_ids = tokenizer.encode(s, return_tensors="pt")

# Generate output sequence

output_ids = model.generate(input_ids, max_length=64, num_beams=5)



# Decode the generated output

output_smiles = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_smiles)



```

# Reference
```

@inproceedings{hoang2024morgangen,

  title={MorganGen: Generative Modeling of SMILES Using Morgan Fingerprint Features},

  author={Hoang, Lam Thanh and D{\'\i}az, Ra{\'u}l Fern{\'a}ndez and Lopez, Vanessa},

  booktitle={American Chemical Society (ACS) Fall Meeting},

  year={2024}

}



```