|
|
--- |
|
|
tags: |
|
|
- chemistry |
|
|
- molecule |
|
|
- drug |
|
|
--- |
|
|
|
|
|
# Model Card for Roberta Zinc 480m |
|
|
|
|
|
### Model Description |
|
|
|
|
|
`roberta_zinc_480m` is a ~102m parameter Roberta-style masked language model ~480m SMILES |
|
|
strings from the [ZINC database](https://zinc.docking.org/). This model is useful for |
|
|
generating embeddings from SMILES strings. |
|
|
|
|
|
- **Developed by:** Karl Heyer |
|
|
- **License:** MIT |
|
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
Usage examples. Note that input SMILES strings should be canonicalized. |
|
|
|
|
|
With the Transformers library: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m") |
|
|
roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m", |
|
|
add_pooling_layer=False) # model was not trained with a pooler |
|
|
|
|
|
# smiles should be canonicalized |
|
|
smiles = [ |
|
|
"Brc1cc2c(NCc3ccccc3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3ccccn3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3cccs3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3ccncc3)ncnc2s1", |
|
|
"Brc1cc2c(Nc3ccccc3)ncnc2s1" |
|
|
] |
|
|
|
|
|
batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8) |
|
|
|
|
|
# mean pooling |
|
|
outputs = roberta_zinc(**batch, output_hidden_states=True) |
|
|
full_embeddings = outputs[1][-1] |
|
|
mask = batch['attention_mask'] |
|
|
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1)) |
|
|
``` |
|
|
|
|
|
With Sentence Transformers: |
|
|
|
|
|
```python |
|
|
from sentence_transformers import models, SentenceTransformer |
|
|
|
|
|
transformer = models.Transformer("entropy/roberta_zinc_480m", |
|
|
max_seq_length=256, |
|
|
model_args={"add_pooling_layer": False}) |
|
|
|
|
|
pooling = models.Pooling(transformer.get_word_embedding_dimension(), |
|
|
pooling_mode="mean") |
|
|
|
|
|
model = SentenceTransformer(modules=[transformer, pooling]) |
|
|
|
|
|
# smiles should be canonicalized |
|
|
smiles = [ |
|
|
"Brc1cc2c(NCc3ccccc3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3ccccn3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3cccs3)ncnc2s1", |
|
|
"Brc1cc2c(NCc3ccncc3)ncnc2s1", |
|
|
"Brc1cc2c(Nc3ccccc3)ncnc2s1" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(smiles, convert_to_tensor=True) |
|
|
``` |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
~480m SMILES strings were randomly sampled from the [ZINC database](https://zinc.docking.org/), |
|
|
weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were |
|
|
canonicalized, then used to train the tokenizer. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The model was trained with cross entropy loss for 150,000 iterations with a batch size of |
|
|
4096. The model achieved a validation loss of ~0.122. |
|
|
|
|
|
### Downstream Models |
|
|
|
|
|
#### Decoder |
|
|
|
|
|
There is a [decoder model](https://huggingface.co/entropy/roberta_zinc_decoder) trained to reconstruct |
|
|
inputs from embeddings generated with this model |
|
|
|
|
|
#### Compression Encoder |
|
|
|
|
|
There is a [compression encoder model](https://huggingface.co/entropy/roberta_zinc_compression_encoder) |
|
|
trained to compress embeddings generated by this model from the native size of 768 to |
|
|
smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings. |
|
|
|
|
|
#### Decomposer |
|
|
|
|
|
There is a [embedding decomposer model](https://huggingface.co/entropy/roberta_zinc_enamine_decomposer) |
|
|
trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine |
|
|
library. |
|
|
|
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
@misc{heyer2023roberta, |
|
|
title={Roberta-zinc-480m}, |
|
|
author={Heyer, Karl}, |
|
|
year={2023} |
|
|
} |
|
|
|
|
|
**APA:** |
|
|
|
|
|
Heyer, K. (2023). Roberta-zinc-480m. |
|
|
|
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Karl Heyer |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
karl@darmatterai.xyz |
|
|
|
|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|