File size: 3,550 Bytes
c189d99
 
 
a36eb3f
 
c189d99
 
2a11676
40a96d5
2a11676
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40a96d5
7a1f2e8
82b67cb
7a1f2e8
82b67cb
 
 
7a1f2e8
2a11676
82b67cb
 
 
 
 
 
 
7a1f2e8
82b67cb
 
 
 
7a1f2e8
82b67cb
7a1f2e8
 
 
2a11676
 
82b67cb
 
 
 
 
 
 
 
 
 
 
 
2a11676
82b67cb
 
 
 
 
 
 
 
2a11676
82b67cb
 
2a11676
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05b00ec
2a11676
05b00ec
2a11676
05b00ec
 
 
 
 
 
 
2a11676
 
 
 
 
 
 
 
 
 
 
 
 
 
82b67cb
f1c3b9b
28cdfdb
2a11676
7a1f2e8
2a11676
d467bdb
2a11676
d467bdb
03c6d05
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
tags: 
- chemistry
- molecule
- drug
---

# Model Card for Roberta Zinc 480m

### Model Description

`roberta_zinc_480m` is a ~102m parameter Roberta-style masked language model ~480m SMILES 
strings from the [ZINC database](https://zinc.docking.org/). This model is useful for 
generating embeddings from SMILES strings.

- **Developed by:** Karl Heyer
- **License:** MIT


### Direct Use

Usage examples. Note that input SMILES strings should be canonicalized.

With the Transformers library:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m")
roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m", 
                                         add_pooling_layer=False) # model was not trained with a pooler

# smiles should be canonicalized
smiles = [
    "Brc1cc2c(NCc3ccccc3)ncnc2s1",
    "Brc1cc2c(NCc3ccccn3)ncnc2s1",
    "Brc1cc2c(NCc3cccs3)ncnc2s1",
    "Brc1cc2c(NCc3ccncc3)ncnc2s1",
    "Brc1cc2c(Nc3ccccc3)ncnc2s1"
]

batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8)

# mean pooling
outputs = roberta_zinc(**batch, output_hidden_states=True)
full_embeddings = outputs[1][-1]
mask = batch['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
```

With Sentence Transformers:

```python
from sentence_transformers import models, SentenceTransformer

transformer = models.Transformer("entropy/roberta_zinc_480m", 
                                 max_seq_length=256, 
                                 model_args={"add_pooling_layer": False})

pooling = models.Pooling(transformer.get_word_embedding_dimension(), 
                         pooling_mode="mean")

model = SentenceTransformer(modules=[transformer, pooling])

# smiles should be canonicalized
smiles = [
    "Brc1cc2c(NCc3ccccc3)ncnc2s1",
    "Brc1cc2c(NCc3ccccn3)ncnc2s1",
    "Brc1cc2c(NCc3cccs3)ncnc2s1",
    "Brc1cc2c(NCc3ccncc3)ncnc2s1",
    "Brc1cc2c(Nc3ccccc3)ncnc2s1"
]

embeddings = model.encode(smiles, convert_to_tensor=True)
```

### Training Procedure

#### Preprocessing

~480m SMILES strings were randomly sampled from the [ZINC database](https://zinc.docking.org/), 
weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were 
canonicalized, then used to train the tokenizer.

#### Training Hyperparameters

The model was trained with cross entropy loss for 150,000 iterations with a batch size of 
4096. The model achieved a validation loss of ~0.122.

### Downstream Models

#### Decoder

There is a [decoder model](https://huggingface.co/entropy/roberta_zinc_decoder) trained to reconstruct 
inputs from embeddings generated with this model

#### Compression Encoder

There is a [compression encoder model](https://huggingface.co/entropy/roberta_zinc_compression_encoder) 
trained to compress embeddings generated by this model from the native size of 768 to 
smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings.

#### Decomposer

There is a [embedding decomposer model](https://huggingface.co/entropy/roberta_zinc_enamine_decomposer) 
trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine 
library.


**BibTeX:**

@misc{heyer2023roberta,
  title={Roberta-zinc-480m},
  author={Heyer, Karl},
  year={2023}
}

**APA:**

Heyer, K. (2023). Roberta-zinc-480m.


## Model Card Authors

Karl Heyer

## Model Card Contact

karl@darmatterai.xyz

---
license: mit
---