|
|
--- |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- smiles-similarity |
|
|
- feature-extraction |
|
|
- molecular-similarity |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
metrics: |
|
|
- spearmanr |
|
|
license: apache-2.0 |
|
|
new_version: Derify/ChemMRL |
|
|
--- |
|
|
|
|
|
# Chem-MRL (SentenceTransformer) |
|
|
|
|
|
This is a trained [Chem-MRL](https://github.com/emapco/chem-mrl) [sentence-transformers](https://www.SBERT.net) model. It maps SMILES to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, database indexing, molecular classification, clustering, and more. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Model Type:** Sentence Transformer |
|
|
- **Maximum Sequence Length:** 128 tokens |
|
|
- **Output Dimensionality:** 1024 dimensions |
|
|
- **Similarity Function:** Cosine Similarity |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [Chem-MRL on GitHub](https://github.com/emapco/chem-mrl) |
|
|
- **Demo App Repository:** [Chem-MRL-demo on GitHub](https://github.com/emapco/chem-mrl-demo) |
|
|
|
|
|
### Full Model Architecture |
|
|
|
|
|
``` |
|
|
SentenceTransformer( |
|
|
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel (ChemBERTa) |
|
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
|
(2): Normalize() |
|
|
) |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
|
|
First install the Sentence Transformers library: |
|
|
|
|
|
```bash |
|
|
pip install -U sentence-transformers |
|
|
``` |
|
|
|
|
|
Then you can load this model and run inference. |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Download from the 🤗 Hub |
|
|
model = SentenceTransformer("Derify/ChemMRL-alpha") |
|
|
# Run inference |
|
|
sentences = [ |
|
|
'CCO', |
|
|
"CC(C)O", |
|
|
'CC(=O)O', |
|
|
] |
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
# [3, 1024] |
|
|
|
|
|
# Get the similarity scores for the embeddings |
|
|
similarities = model.similarity(embeddings, embeddings) |
|
|
print(similarities.shape) |
|
|
# [3, 3] |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Framework Versions |
|
|
- Python: 3.12.9 |
|
|
- Sentence Transformers: 4.0.1 |
|
|
- Transformers: 4.48.2 |
|
|
- PyTorch: 2.6.0+cu124 |
|
|
- Accelerate: 1.4.0 |
|
|
- Datasets: 3.3.2 |
|
|
- Tokenizers: 0.21.0 |
|
|
|
|
|
## Citation |
|
|
- Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." _arXiv [Cs.LG]_, 2020. [Link](http://arxiv.org/abs/2010.09885). |
|
|
- Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." _arXiv [Cs.LG]_, 2022. [Link](http://arxiv.org/abs/2209.01712). |
|
|
- Kusupati, Aditya, et al. "Matryoshka Representation Learning." _arXiv [Cs.LG]_, 2022. [Link](https://arxiv.org/abs/2205.13147). |
|
|
- Li, Xianming, et al. "2D Matryoshka Sentence Embeddings." _arXiv [Cs.CL]_, 2024. [Link](http://arxiv.org/abs/2402.14776). |
|
|
- Bajusz, Dávid, et al. "Why is the Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations?" _J Cheminform_, 7, 20 (2015). [Link](https://doi.org/10.1186/s13321-015-0069-3). |
|
|
- Li, Xiaoya, et al. "Dice Loss for Data-imbalanced NLP Tasks." _arXiv [Cs.CL]_, 2020. [Link](https://arxiv.org/abs/1911.02855) |
|
|
- Reimers, Nils, and Gurevych, Iryna. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." _Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing_, 2019. [Link](https://arxiv.org/abs/1908.10084). |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[@eacortes](https://huggingface.co/eacortes) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
Manny Cortes (manny@derifyai.com) |