|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- chemistry |
|
|
- molecular-similarity |
|
|
- cheminformatics |
|
|
- ssl |
|
|
- smiles |
|
|
- feature-extraction |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
--- |
|
|
|
|
|
# miniChembed-prototype |
|
|
|
|
|
This is an experimental **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including: |
|
|
|
|
|
- **ChEMBL34** (Zdrazil et al., 2023) |
|
|
- **COCONUTDB** (Sorokina et al., 2021) |
|
|
- **SuperNatural3** (Gallo et al., 2023) |
|
|
|
|
|
The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**. |
|
|
|
|
|
Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules. |
|
|
The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations. |
|
|
|
|
|
> Note: This is an experimental prototype. |
|
|
> Feel free to experiment with and edit the training script as you wish! |
|
|
> Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations. |
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture & Training |
|
|
|
|
|
| Attribute | Value | |
|
|
|----------|-------| |
|
|
| **Base architecture** | Custom RoBERTa-style transformer (6 layers, 320 hidden dim, 4 attention heads, ~8M params) | |
|
|
| **Initialization** | Random (not pretrained on text or chemistry) | |
|
|
| **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix | |
|
|
| **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) | |
|
|
| **Training data** | ~24K unique molecules → augmented into positive pairs | |
|
|
| **Sequence length** | 514 tokens | |
|
|
| **Embedding dimension** | 320 | |
|
|
| **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) | |
|
|
| **Pooling** | Mean pooling over token embeddings | |
|
|
| **Similarity metric** | Cosine similarity | |
|
|
| **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) | |
|
|
| **Learning rate** | 1e-4 | |
|
|
| **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) | |
|
|
| **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) | |
|
|
| **Barlow λ** | 5.0 (stronger off-diagonal penalty) | |
|
|
| **Training duration** | 5 epochs | |
|
|
| **Hardware** | Single NVIDIA 930MX GPU | |
|
|
|
|
|
### Architecture (SentenceTransformer format) |
|
|
```python |
|
|
SentenceTransformer( |
|
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'}) |
|
|
(1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
|
) |
|
|
``` |
|
|
|
|
|
> Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective. |
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install -U sentence-transformers rdkit-pypi |
|
|
``` |
|
|
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Download from the 🤗 Hub |
|
|
model = SentenceTransformer("gbyuvd/miniChembed-prototype") |
|
|
# Run inference |
|
|
sentences = [ |
|
|
'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine |
|
|
"n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline |
|
|
"c1ncccc1[C@@H]2CCCN2C", # Nicotine |
|
|
'Nc1nc2cncc-2co1', # CID: 162789184 |
|
|
] |
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
# (4, 320) |
|
|
|
|
|
# Get the similarity scores for the embeddings |
|
|
similarities = model.similarity(embeddings, embeddings) |
|
|
print(similarities) |
|
|
# tensor([[ 1.0000, 0.2279, -0.1979, -0.3754], |
|
|
# [ 0.2279, 1.0000, 0.7371, 0.6745], |
|
|
# [-0.1979, 0.7371, 1.0000, 0.9803], |
|
|
# [-0.3754, 0.6745, 0.9803, 1.0000]]) |
|
|
``` |
|
|
|
|
|
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling. |
|
|
|
|
|
### Testing Similarity Search |
|
|
> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS. |
|
|
|
|
|
For an example of FAISS indexing pipeline, see `./examples/faiss.ipynb` |
|
|
|
|
|
Cytisine as query, on 24K embedded index: |
|
|
 |
|
|
|
|
|
``` |
|
|
Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944 |
|
|
Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940 |
|
|
Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938 |
|
|
Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938 |
|
|
Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929 |
|
|
``` |
|
|
|
|
|
|
|
|
## Comparison to Traditional Fingerprints |
|
|
### Overview |
|
|
| Feature | ECFP4 / MACCS | miniChembed-prototype | |
|
|
|--------|----------------|------------------------| |
|
|
| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding | |
|
|
| **Training data** | None (rule-based) | ~24K unlabeled SMILES | |
|
|
| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation | |
|
|
| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) | |
|
|
|
|
|
### Clustering |
|
|
|
|
|
Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes: |
|
|
|
|
|
 |
|
|
|
|
|
``` |
|
|
ARI (Embeddings) : 0.084 |
|
|
ARI (ECFP4) : 0.024 |
|
|
Silhouette (Embeddings) : 0.398 |
|
|
Silhouette (ECFP4) : 0.025 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Summary |
|
|
|
|
|
- **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views. |
|
|
- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)` |
|
|
→ Higher = better separation between intra- and inter-molecular similarity. |
|
|
- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score. |
|
|
- **Final health**: 0.891 at step 1885, indicating strong disentanglement. |
|
|
|
|
|
``` |
|
|
Step 1885 | Alignment=0.017 | Uniformity=-1.338 |
|
|
Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518 |
|
|
Barlow Health: 0.891 |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown. |
|
|
- Input must be **valid SMILES**; invalid strings may produce erratic embeddings. |
|
|
- **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function. |
|
|
- Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective. |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment: |
|
|
|
|
|
- Python: 3.13.0 |
|
|
- Transformers: 4.56.2 |
|
|
- PyTorch: 2.6.0+cu126 |
|
|
- Accelerate: 1.10.1 |
|
|
- Datasets: 4.0.0 |
|
|
- Tokenizers: 0.22.0 |
|
|
|
|
|
Training code, config, and evaluation are available on this repo under `./train/trainbarlow.py` and `./train/config.yaml` |
|
|
|
|
|
--- |
|
|
|
|
|
## Reference: |
|
|
Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES. |
|
|
|
|
|
``` |
|
|
@misc{çağatan2024unseeunsupervisednoncontrastivesentence, |
|
|
title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings}, |
|
|
author={Ömer Veysel Çağatan}, |
|
|
year={2024}, |
|
|
eprint={2401.15316}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2401.15316}, |
|
|
} |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
SBERT: |
|
|
@inproceedings{reimers-2019-sentence-bert, |
|
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
|
year = "2019", |
|
|
url = "https://arxiv.org/abs/1908.10084" |
|
|
} |
|
|
|
|
|
Tokenizer: |
|
|
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining, |
|
|
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, |
|
|
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, |
|
|
year={2020}, |
|
|
eprint={2010.09885}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2010.09885}, |
|
|
} |
|
|
|
|
|
Data: |
|
|
@article{sorokina2021coconut, |
|
|
title={COCONUT online: Collection of Open Natural Products database}, |
|
|
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph}, |
|
|
journal={Journal of Cheminformatics}, |
|
|
volume={13}, |
|
|
number={1}, |
|
|
pages={2}, |
|
|
year={2021}, |
|
|
doi={10.1186/s13321-020-00478-9} |
|
|
} |
|
|
|
|
|
@article{zdrazil2023chembl, |
|
|
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods}, |
|
|
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R}, |
|
|
journal={Nucleic Acids Research}, |
|
|
year={2023}, |
|
|
volume={gkad1004}, |
|
|
doi={10.1093/nar/gkad1004} |
|
|
} |
|
|
|
|
|
@misc{chembl34, |
|
|
title={ChemBL34}, |
|
|
year={2023}, |
|
|
doi={10.6019/CHEMBL.database.34} |
|
|
} |
|
|
|
|
|
@article{Gallo2023, |
|
|
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P}, |
|
|
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}}, |
|
|
journal = {Nucleic Acids Research}, |
|
|
year = {2023}, |
|
|
month = jan, |
|
|
day = {6}, |
|
|
volume = {51}, |
|
|
number = {D1}, |
|
|
pages = {D654-D659}, |
|
|
doi = {10.1093/nar/gkac1008} |
|
|
} |
|
|
|
|
|
Optimizer: |
|
|
@article{wright2021ranger21, |
|
|
title={Ranger21: a synergistic deep learning optimizer}, |
|
|
author={Wright, Less and Demeure, Nestor}, |
|
|
year={2021}, |
|
|
journal={arXiv preprint arXiv:2106.13731}, |
|
|
} |
|
|
|
|
|
``` |