ChEmbed v0.1 - Chemical Embeddings
This prototype is a sentence-transformers based on MiniLM-L6-H384-uncased fine-tuned on around 1 million pairs of valid natural compounds' SELFIES (Krenn et al. 2020) taken from COCONUTDB (Sorokina et al. 2021). It maps compounds' Self-Referencing Embedded Strings (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
I am planning to train this model with more epochs on current dataset, before moving on to a larger dataset with 6 million pairs generated from ChemBL34. However, this will take some time due to computational and financial constraints. A future project of mine is to develop a custom model specifically for cheminformatics to address any biases and optimization issues in repurposing an embedding model designed for NLP tasks.
Update
This model won't be trained further on current natural products dataset nor the ChemBL34, since I've been working on pre-training a BERT-like base model that operates on SELFIES with a custom tokenizer for past two weeks. This base model was scheduled for release this week, but due to mistakes in parsing some SELFIES notations, the pre-training is halted and I am working intensely to correct these issues and continue the training. The base model will hopefully released next week. Following this, I plan to fine-tune a sentence transformer and a classifier model built on top of that base model. The timeline for these tasks depends on the availability of compute server and my own time constraints, as I also need to finish my undergrad thesis. Thank you for checking out this model.
The base model is now available here New version of this model is now availale here
Disclaimer: For Academic Purposes Only
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
Usage
Tutorial for using on a Large Dataset
See this JupyterNotebook tutorial for using the above model to embed ~407K natural product molecular SELFIES representations (from COCONUTDB) and utilize Meta's FAISS for indexing and doing fast searches for structurally similar compounds based on one or more inputs.
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("gbyuvd/ChemEmbed-v01")
# Run inference
sentences = [
'[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]',
'[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]',
'[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: MiniLM-L6-H384-uncased
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Training Dataset: SELFIES pairs generated from COCONUTDB
- Language: SELFIES
- License: CC BY-NC 4.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)
Dataset
| Dataset | Reference | Number of Pairs |
|---|---|---|
| COCONUTDB (0.8:0.1:0.1 split) | (Sorokina et al. 2021) | 1,183,174 |
Evaluation
Metrics
Semantic Similarity
- Dataset:
NP-isotest - Number of test pairs: 118,318
- Evaluated with
EmbeddingSimilarityEvaluator
| Metric | Value |
|---|---|
| pearson_cosine | 0.9367 |
| spearman_cosine | 0.9303 |
| pearson_manhattan | 0.8263 |
| spearman_manhattan | 0.8452 |
| pearson_euclidean | 0.8654 |
| spearman_euclidean | 0.9243 |
| pearson_dot | 0.9232 |
| spearman_dot | 0.9367 |
| pearson_max | 0.9303 |
| spearman_max | 0.8961 |
Limitations
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
Testing Generated Embeddings' Clusters
The plot below shows how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
Framework Versions
- Python: 3.9.13
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu121
- Accelerate: 0.31.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Acknowledgments
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
COCONUTDB
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
SELFIES
- For more information on SELFIES, you could read this blogpost or check out their github.
@article{krenn2020selfies,
title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
journal={Machine Learning: Science and Technology},
volume={1},
number={4},
pages={045024},
year={2020},
doi={10.1088/2632-2153/aba947}
}
Contact
G Bayu (gbyuvd@proton.me)
- Downloads last month
- 33
Model tree for gbyuvd/ChemEmbed-v01
Collection including gbyuvd/ChemEmbed-v01
Evaluation results
- Pearson Cosine on NP isotestself-reported0.937
- Spearman Cosine on NP isotestself-reported0.930
- Pearson Manhattan on NP isotestself-reported0.826
- Spearman Manhattan on NP isotestself-reported0.845
- Pearson Euclidean on NP isotestself-reported0.843
- Spearman Euclidean on NP isotestself-reported0.865
- Pearson Dot on NP isotestself-reported0.924
- Spearman Dot on NP isotestself-reported0.923
- Pearson Max on NP isotestself-reported0.937
- Spearman Max on NP isotestself-reported0.930

