Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
tags:
|
| 3 |
- sentence-transformers
|
| 4 |
- molecular-similarity
|
|
@@ -43,7 +44,7 @@ library_name: sentence-transformers
|
|
| 43 |
metrics:
|
| 44 |
- spearman
|
| 45 |
model-index:
|
| 46 |
-
- name:
|
| 47 |
results:
|
| 48 |
- task:
|
| 49 |
type: semantic-similarity
|
|
@@ -55,10 +56,10 @@ model-index:
|
|
| 55 |
- type: spearman
|
| 56 |
value: 0.9932120589500998
|
| 57 |
name: Spearman
|
| 58 |
-
|
| 59 |
---
|
| 60 |
|
| 61 |
-
#
|
| 62 |
|
| 63 |
This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
|
| 64 |
|
|
@@ -72,7 +73,7 @@ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers
|
|
| 72 |
- **Similarity Function:** Tanimoto
|
| 73 |
- **Training Dataset:**
|
| 74 |
- [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
|
| 75 |
-
- **License:**
|
| 76 |
|
| 77 |
### Model Sources
|
| 78 |
|
|
@@ -104,32 +105,32 @@ Then you can load this model and run inference.
|
|
| 104 |
from chem_mrl import ChemMRL
|
| 105 |
|
| 106 |
# Download from the 🤗 Hub
|
| 107 |
-
|
| 108 |
# Run inference
|
| 109 |
sentences = [
|
| 110 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 111 |
"O=Cc1nc2ccccc2o1",
|
| 112 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 113 |
]
|
| 114 |
-
embeddings =
|
| 115 |
print(embeddings.shape)
|
| 116 |
# [3, 1024]
|
| 117 |
|
| 118 |
# Get the similarity scores for the embeddings
|
| 119 |
-
similarities =
|
| 120 |
print(similarities)
|
| 121 |
# tensor([[1.0000, 0.3200, 0.1209],
|
| 122 |
# [0.3200, 1.0000, 0.0950],
|
| 123 |
# [0.1209, 0.0950, 1.0000]])
|
| 124 |
|
| 125 |
# Load the model with half precision
|
| 126 |
-
|
| 127 |
sentences = [
|
| 128 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 129 |
"O=Cc1nc2ccccc2o1",
|
| 130 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 131 |
]
|
| 132 |
-
embeddings =
|
| 133 |
print(embeddings.shape)
|
| 134 |
# [3, 1024]
|
| 135 |
```
|
|
@@ -148,10 +149,10 @@ print(embeddings.shape)
|
|
| 148 |
}
|
| 149 |
```
|
| 150 |
|
| 151 |
-
| Split
|
| 152 |
-
|
|
| 153 |
| **validation** | **spearman** | **0.993212** |
|
| 154 |
-
| **test**
|
| 155 |
|
| 156 |
## Training Details
|
| 157 |
|
|
@@ -166,7 +167,7 @@ print(embeddings.shape)
|
|
| 166 |
| | smiles_a | smiles_b | label |
|
| 167 |
| :------ | :---------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------- | :-------------------------------------------------------------- |
|
| 168 |
| type | string | string | float |
|
| 169 |
-
| details | <ul><li>min: 17 tokens</li><li>mean: 39.66 tokens</li><li>max: 119 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 38.29 tokens</li><li>max: 115 tokens</li></ul> | <ul><li>min: 0.02</li><li>mean: 0.57</li><li>max: 1.0</li></ul> |
|
| 170 |
* Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
|
| 171 |
<details><summary>Click to expand</summary>
|
| 172 |
|
|
@@ -492,12 +493,12 @@ print(embeddings.shape)
|
|
| 492 |
|
| 493 |
#### TanimotoSentLoss
|
| 494 |
```bibtex
|
| 495 |
-
@online{
|
| 496 |
title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
|
| 497 |
author={Emmanuel Cortes},
|
| 498 |
year={2025},
|
| 499 |
month={Jan},
|
| 500 |
-
url={https://github.com/emapco/chem-mrl
|
| 501 |
}
|
| 502 |
```
|
| 503 |
|
|
@@ -507,4 +508,4 @@ print(embeddings.shape)
|
|
| 507 |
|
| 508 |
## Model Card Contact
|
| 509 |
|
| 510 |
-
Manny Cortes (manny@derifyai.com)
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- sentence-transformers
|
| 5 |
- molecular-similarity
|
|
|
|
| 44 |
metrics:
|
| 45 |
- spearman
|
| 46 |
model-index:
|
| 47 |
+
- name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
|
| 48 |
results:
|
| 49 |
- task:
|
| 50 |
type: semantic-similarity
|
|
|
|
| 56 |
- type: spearman
|
| 57 |
value: 0.9932120589500998
|
| 58 |
name: Spearman
|
| 59 |
+
new_version: Derify/ChemMRL
|
| 60 |
---
|
| 61 |
|
| 62 |
+
# ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer
|
| 63 |
|
| 64 |
This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
|
| 65 |
|
|
|
|
| 73 |
- **Similarity Function:** Tanimoto
|
| 74 |
- **Training Dataset:**
|
| 75 |
- [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
|
| 76 |
+
- **License:** apache-2.0
|
| 77 |
|
| 78 |
### Model Sources
|
| 79 |
|
|
|
|
| 105 |
from chem_mrl import ChemMRL
|
| 106 |
|
| 107 |
# Download from the 🤗 Hub
|
| 108 |
+
model = ChemMRL("Derify/ChemMRL-beta")
|
| 109 |
# Run inference
|
| 110 |
sentences = [
|
| 111 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 112 |
"O=Cc1nc2ccccc2o1",
|
| 113 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 114 |
]
|
| 115 |
+
embeddings = model.backbone.encode(sentences)
|
| 116 |
print(embeddings.shape)
|
| 117 |
# [3, 1024]
|
| 118 |
|
| 119 |
# Get the similarity scores for the embeddings
|
| 120 |
+
similarities = model.backbone.similarity(embeddings, embeddings)
|
| 121 |
print(similarities)
|
| 122 |
# tensor([[1.0000, 0.3200, 0.1209],
|
| 123 |
# [0.3200, 1.0000, 0.0950],
|
| 124 |
# [0.1209, 0.0950, 1.0000]])
|
| 125 |
|
| 126 |
# Load the model with half precision
|
| 127 |
+
model = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
|
| 128 |
sentences = [
|
| 129 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 130 |
"O=Cc1nc2ccccc2o1",
|
| 131 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 132 |
]
|
| 133 |
+
embeddings = model.embed(sentences) # Use the embed method for half precision
|
| 134 |
print(embeddings.shape)
|
| 135 |
# [3, 1024]
|
| 136 |
```
|
|
|
|
| 149 |
}
|
| 150 |
```
|
| 151 |
|
| 152 |
+
| Split | Metric | Value |
|
| 153 |
+
| :------------- | :----------- | :----------- |
|
| 154 |
| **validation** | **spearman** | **0.993212** |
|
| 155 |
+
| **test** | **spearman** | **0.993243** |
|
| 156 |
|
| 157 |
## Training Details
|
| 158 |
|
|
|
|
| 167 |
| | smiles_a | smiles_b | label |
|
| 168 |
| :------ | :---------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------- | :-------------------------------------------------------------- |
|
| 169 |
| type | string | string | float |
|
| 170 |
+
| details | <ul><li>min: 17 tokens</li><li>mean: 39.66 tokens</li><li>max: 119 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 38.29 tokens</li><li>max: 115 tokens</li></ul> | <ul><li>min: 0.02</li><li>mean: 0.57</li><li>max: 1.0</li></ul> | | <code>0.7123287916183472</code> |
|
| 171 |
* Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
|
| 172 |
<details><summary>Click to expand</summary>
|
| 173 |
|
|
|
|
| 493 |
|
| 494 |
#### TanimotoSentLoss
|
| 495 |
```bibtex
|
| 496 |
+
@online{cortes-2025-tanimotosentloss,
|
| 497 |
title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
|
| 498 |
author={Emmanuel Cortes},
|
| 499 |
year={2025},
|
| 500 |
month={Jan},
|
| 501 |
+
url={https://github.com/emapco/chem-mrl},
|
| 502 |
}
|
| 503 |
```
|
| 504 |
|
|
|
|
| 508 |
|
| 509 |
## Model Card Contact
|
| 510 |
|
| 511 |
+
Manny Cortes (manny@derifyai.com)
|