File size: 3,550 Bytes
c189d99 a36eb3f c189d99 2a11676 40a96d5 2a11676 40a96d5 7a1f2e8 82b67cb 7a1f2e8 82b67cb 7a1f2e8 2a11676 82b67cb 7a1f2e8 82b67cb 7a1f2e8 82b67cb 7a1f2e8 2a11676 82b67cb 2a11676 82b67cb 2a11676 82b67cb 2a11676 05b00ec 2a11676 05b00ec 2a11676 05b00ec 2a11676 82b67cb f1c3b9b 28cdfdb 2a11676 7a1f2e8 2a11676 d467bdb 2a11676 d467bdb 03c6d05 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
tags:
- chemistry
- molecule
- drug
---
# Model Card for Roberta Zinc 480m
### Model Description
`roberta_zinc_480m` is a ~102m parameter Roberta-style masked language model ~480m SMILES
strings from the [ZINC database](https://zinc.docking.org/). This model is useful for
generating embeddings from SMILES strings.
- **Developed by:** Karl Heyer
- **License:** MIT
### Direct Use
Usage examples. Note that input SMILES strings should be canonicalized.
With the Transformers library:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m")
roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m",
add_pooling_layer=False) # model was not trained with a pooler
# smiles should be canonicalized
smiles = [
"Brc1cc2c(NCc3ccccc3)ncnc2s1",
"Brc1cc2c(NCc3ccccn3)ncnc2s1",
"Brc1cc2c(NCc3cccs3)ncnc2s1",
"Brc1cc2c(NCc3ccncc3)ncnc2s1",
"Brc1cc2c(Nc3ccccc3)ncnc2s1"
]
batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8)
# mean pooling
outputs = roberta_zinc(**batch, output_hidden_states=True)
full_embeddings = outputs[1][-1]
mask = batch['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
```
With Sentence Transformers:
```python
from sentence_transformers import models, SentenceTransformer
transformer = models.Transformer("entropy/roberta_zinc_480m",
max_seq_length=256,
model_args={"add_pooling_layer": False})
pooling = models.Pooling(transformer.get_word_embedding_dimension(),
pooling_mode="mean")
model = SentenceTransformer(modules=[transformer, pooling])
# smiles should be canonicalized
smiles = [
"Brc1cc2c(NCc3ccccc3)ncnc2s1",
"Brc1cc2c(NCc3ccccn3)ncnc2s1",
"Brc1cc2c(NCc3cccs3)ncnc2s1",
"Brc1cc2c(NCc3ccncc3)ncnc2s1",
"Brc1cc2c(Nc3ccccc3)ncnc2s1"
]
embeddings = model.encode(smiles, convert_to_tensor=True)
```
### Training Procedure
#### Preprocessing
~480m SMILES strings were randomly sampled from the [ZINC database](https://zinc.docking.org/),
weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were
canonicalized, then used to train the tokenizer.
#### Training Hyperparameters
The model was trained with cross entropy loss for 150,000 iterations with a batch size of
4096. The model achieved a validation loss of ~0.122.
### Downstream Models
#### Decoder
There is a [decoder model](https://huggingface.co/entropy/roberta_zinc_decoder) trained to reconstruct
inputs from embeddings generated with this model
#### Compression Encoder
There is a [compression encoder model](https://huggingface.co/entropy/roberta_zinc_compression_encoder)
trained to compress embeddings generated by this model from the native size of 768 to
smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings.
#### Decomposer
There is a [embedding decomposer model](https://huggingface.co/entropy/roberta_zinc_enamine_decomposer)
trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine
library.
**BibTeX:**
@misc{heyer2023roberta,
title={Roberta-zinc-480m},
author={Heyer, Karl},
year={2023}
}
**APA:**
Heyer, K. (2023). Roberta-zinc-480m.
## Model Card Authors
Karl Heyer
## Model Card Contact
karl@darmatterai.xyz
---
license: mit
---
|