gbyuvd's picture
Update README.md
f043678 verified
---
license: mit
tags:
- sentence-transformers
- chemistry
- molecular-similarity
- cheminformatics
- ssl
- smiles
- feature-extraction
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---
# miniChembed-prototype
This is an experimental **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including:
- **ChEMBL34** (Zdrazil et al., 2023)
- **COCONUTDB** (Sorokina et al., 2021)
- **SuperNatural3** (Gallo et al., 2023)
The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**.
Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules.
The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations.
> Note: This is an experimental prototype.
> Feel free to experiment with and edit the training script as you wish!
> Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
---
## Model Details
### Architecture & Training
| Attribute | Value |
|----------|-------|
| **Base architecture** | Custom RoBERTa-style transformer (6 layers, 320 hidden dim, 4 attention heads, ~8M params) |
| **Initialization** | Random (not pretrained on text or chemistry) |
| **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix |
| **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) |
| **Training data** | ~24K unique molecules → augmented into positive pairs |
| **Sequence length** | 514 tokens |
| **Embedding dimension** | 320 |
| **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) |
| **Pooling** | Mean pooling over token embeddings |
| **Similarity metric** | Cosine similarity |
| **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) |
| **Learning rate** | 1e-4 |
| **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) |
| **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) |
| **Barlow λ** | 5.0 (stronger off-diagonal penalty) |
| **Training duration** | 5 epochs |
| **Hardware** | Single NVIDIA 930MX GPU |
### Architecture (SentenceTransformer format)
```python
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
(1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
> Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
---
## Usage
### Installation
```bash
pip install -U sentence-transformers rdkit-pypi
```
### Direct Usage (Sentence Transformers)
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("gbyuvd/miniChembed-prototype")
# Run inference
sentences = [
'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
"n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline
"c1ncccc1[C@@H]2CCCN2C", # Nicotine
'Nc1nc2cncc-2co1', # CID: 162789184
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (4, 320)
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.2279, -0.1979, -0.3754],
# [ 0.2279, 1.0000, 0.7371, 0.6745],
# [-0.1979, 0.7371, 1.0000, 0.9803],
# [-0.3754, 0.6745, 0.9803, 1.0000]])
```
High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.
### Testing Similarity Search
> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.
For an example of FAISS indexing pipeline, see `./examples/faiss.ipynb`
Cytisine as query, on 24K embedded index:
![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/kZciikiDjFOCXJrCzb1Lh.png)
```
Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944
Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940
Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938
Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938
Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929
```
## Comparison to Traditional Fingerprints
### Overview
| Feature | ECFP4 / MACCS | miniChembed-prototype |
|--------|----------------|------------------------|
| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
| **Training data** | None (rule-based) | ~24K unlabeled SMILES |
| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |
### Clustering
Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:
![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)
```
ARI (Embeddings) : 0.084
ARI (ECFP4) : 0.024
Silhouette (Embeddings) : 0.398
Silhouette (ECFP4) : 0.025
```
---
## Training Summary
- **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views.
- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`
→ Higher = better separation between intra- and inter-molecular similarity.
- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
- **Final health**: 0.891 at step 1885, indicating strong disentanglement.
```
Step 1885 | Alignment=0.017 | Uniformity=-1.338
Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518
Barlow Health: 0.891
```
---
## Limitations
- Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown.
- Input must be **valid SMILES**; invalid strings may produce erratic embeddings.
- **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function.
- Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective.
---
## Reproducibility
This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment:
- Python: 3.13.0
- Transformers: 4.56.2
- PyTorch: 2.6.0+cu126
- Accelerate: 1.10.1
- Datasets: 4.0.0
- Tokenizers: 0.22.0
Training code, config, and evaluation are available on this repo under `./train/trainbarlow.py` and `./train/config.yaml`
---
## Reference:
Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES.
```
@misc{çağatan2024unseeunsupervisednoncontrastivesentence,
title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings},
author={Ömer Veysel Çağatan},
year={2024},
eprint={2401.15316},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2401.15316},
}
```
---
## Citation
If you use this model, please cite:
```bibtex
SBERT:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
year = "2019",
url = "https://arxiv.org/abs/1908.10084"
}
Tokenizer:
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2020},
eprint={2010.09885},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.09885},
}
Data:
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}
Optimizer:
@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}
```