Update README.md
Browse files
README.md
CHANGED
|
@@ -1,13 +1,10 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
tags:
|
| 5 |
- sentence-transformers
|
| 6 |
-
-
|
| 7 |
- feature-extraction
|
| 8 |
- dense
|
| 9 |
- generated_from_trainer
|
| 10 |
-
- dataset_size:19692766
|
| 11 |
- loss:Matryoshka2dLoss
|
| 12 |
- loss:MatryoshkaLoss
|
| 13 |
- loss:TanimotoSentLoss
|
|
@@ -36,7 +33,8 @@ widget:
|
|
| 36 |
- source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
|
| 37 |
sentences:
|
| 38 |
- O=Cc1nc2ccccc2o1
|
| 39 |
-
-
|
|
|
|
| 40 |
- O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
|
| 41 |
datasets:
|
| 42 |
- Derify/pubchem_10m_genmol_similarity
|
|
@@ -57,11 +55,12 @@ model-index:
|
|
| 57 |
- type: spearman
|
| 58 |
value: 0.9932120589500998
|
| 59 |
name: Spearman
|
|
|
|
| 60 |
---
|
| 61 |
|
| 62 |
# SentenceTransformer based on Derify/ChemBERTa-druglike
|
| 63 |
|
| 64 |
-
This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for
|
| 65 |
|
| 66 |
## Model Details
|
| 67 |
|
|
@@ -73,7 +72,6 @@ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers
|
|
| 73 |
- **Similarity Function:** Tanimoto
|
| 74 |
- **Training Dataset:**
|
| 75 |
- [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
|
| 76 |
-
- **Language:** en
|
| 77 |
- **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
|
| 78 |
|
| 79 |
### Model Sources
|
|
@@ -93,12 +91,12 @@ SentenceTransformer(
|
|
| 93 |
|
| 94 |
## Usage
|
| 95 |
|
| 96 |
-
### Direct Usage (
|
| 97 |
|
| 98 |
-
First install the
|
| 99 |
|
| 100 |
```bash
|
| 101 |
-
pip install -U
|
| 102 |
```
|
| 103 |
|
| 104 |
Then you can load this model and run inference.
|
|
@@ -106,23 +104,34 @@ Then you can load this model and run inference.
|
|
| 106 |
from chem_mrl import ChemMRL
|
| 107 |
|
| 108 |
# Download from the 🤗 Hub
|
| 109 |
-
|
| 110 |
# Run inference
|
| 111 |
sentences = [
|
| 112 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 113 |
"O=Cc1nc2ccccc2o1",
|
| 114 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 115 |
]
|
| 116 |
-
embeddings =
|
| 117 |
print(embeddings.shape)
|
| 118 |
# [3, 1024]
|
| 119 |
|
| 120 |
# Get the similarity scores for the embeddings
|
| 121 |
-
similarities =
|
| 122 |
print(similarities)
|
| 123 |
-
# tensor([[1.0000, 0.
|
| 124 |
-
# [0.
|
| 125 |
-
# [0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
```
|
| 127 |
|
| 128 |
## Evaluation
|
|
@@ -139,21 +148,10 @@ print(similarities)
|
|
| 139 |
}
|
| 140 |
```
|
| 141 |
|
| 142 |
-
| Metric | Value |
|
| 143 |
-
| :----------- | :--------- |
|
| 144 |
-
| **spearman** | **0.
|
| 145 |
-
|
| 146 |
-
<!--
|
| 147 |
-
## Bias, Risks and Limitations
|
| 148 |
-
|
| 149 |
-
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 150 |
-
-->
|
| 151 |
-
|
| 152 |
-
<!--
|
| 153 |
-
### Recommendations
|
| 154 |
-
|
| 155 |
-
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
| 156 |
-
-->
|
| 157 |
|
| 158 |
## Training Details
|
| 159 |
|
|
@@ -176,6 +174,8 @@ print(similarities)
|
|
| 176 |
| <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code> | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code> | <code>0.6615384817123413</code> |
|
| 177 |
| <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code> | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code> | <code>0.7123287916183472</code> |
|
| 178 |
* Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
|
|
|
|
|
|
|
| 179 |
```json
|
| 180 |
{
|
| 181 |
"loss": "TanimotoSentLoss",
|
|
@@ -207,6 +207,7 @@ print(similarities)
|
|
| 207 |
"n_dims_per_step": -1
|
| 208 |
}
|
| 209 |
```
|
|
|
|
| 210 |
|
| 211 |
### Training Hyperparameters
|
| 212 |
#### Non-Default Hyperparameters
|
|
@@ -512,4 +513,4 @@ print(similarities)
|
|
| 512 |
|
| 513 |
## Model Card Contact
|
| 514 |
|
| 515 |
-
Manny Cortes (manny@derifyai.com)
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
tags:
|
| 3 |
- sentence-transformers
|
| 4 |
+
- molecular-similarity
|
| 5 |
- feature-extraction
|
| 6 |
- dense
|
| 7 |
- generated_from_trainer
|
|
|
|
| 8 |
- loss:Matryoshka2dLoss
|
| 9 |
- loss:MatryoshkaLoss
|
| 10 |
- loss:TanimotoSentLoss
|
|
|
|
| 33 |
- source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
|
| 34 |
sentences:
|
| 35 |
- O=Cc1nc2ccccc2o1
|
| 36 |
+
- >-
|
| 37 |
+
O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
|
| 38 |
- O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
|
| 39 |
datasets:
|
| 40 |
- Derify/pubchem_10m_genmol_similarity
|
|
|
|
| 55 |
- type: spearman
|
| 56 |
value: 0.9932120589500998
|
| 57 |
name: Spearman
|
| 58 |
+
license: apache-2.0
|
| 59 |
---
|
| 60 |
|
| 61 |
# SentenceTransformer based on Derify/ChemBERTa-druglike
|
| 62 |
|
| 63 |
+
This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
|
| 64 |
|
| 65 |
## Model Details
|
| 66 |
|
|
|
|
| 72 |
- **Similarity Function:** Tanimoto
|
| 73 |
- **Training Dataset:**
|
| 74 |
- [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
|
|
|
|
| 75 |
- **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
|
| 76 |
|
| 77 |
### Model Sources
|
|
|
|
| 91 |
|
| 92 |
## Usage
|
| 93 |
|
| 94 |
+
### Direct Usage (Chem-MRL)
|
| 95 |
|
| 96 |
+
First install the Chem-MRL library:
|
| 97 |
|
| 98 |
```bash
|
| 99 |
+
pip install -U chem-mrl>=0.7.3
|
| 100 |
```
|
| 101 |
|
| 102 |
Then you can load this model and run inference.
|
|
|
|
| 104 |
from chem_mrl import ChemMRL
|
| 105 |
|
| 106 |
# Download from the 🤗 Hub
|
| 107 |
+
chem_mrl = ChemMRL("Derify/ChemMRL-beta")
|
| 108 |
# Run inference
|
| 109 |
sentences = [
|
| 110 |
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 111 |
"O=Cc1nc2ccccc2o1",
|
| 112 |
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 113 |
]
|
| 114 |
+
embeddings = chem_mrl.backbone.encode(sentences)
|
| 115 |
print(embeddings.shape)
|
| 116 |
# [3, 1024]
|
| 117 |
|
| 118 |
# Get the similarity scores for the embeddings
|
| 119 |
+
similarities = chem_mrl.backbone.similarity(embeddings, embeddings)
|
| 120 |
print(similarities)
|
| 121 |
+
# tensor([[1.0000, 0.3200, 0.1209],
|
| 122 |
+
# [0.3200, 1.0000, 0.0950],
|
| 123 |
+
# [0.1209, 0.0950, 1.0000]])
|
| 124 |
+
|
| 125 |
+
# Load the model with half precision
|
| 126 |
+
chem_mrl = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
|
| 127 |
+
sentences = [
|
| 128 |
+
"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
|
| 129 |
+
"O=Cc1nc2ccccc2o1",
|
| 130 |
+
"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
|
| 131 |
+
]
|
| 132 |
+
embeddings = chem_mrl.embed(sentences) # Use the embed method for half precision
|
| 133 |
+
print(embeddings.shape)
|
| 134 |
+
# [3, 1024]
|
| 135 |
```
|
| 136 |
|
| 137 |
## Evaluation
|
|
|
|
| 148 |
}
|
| 149 |
```
|
| 150 |
|
| 151 |
+
| Split | Metric | Value |
|
| 152 |
+
| :--------- | :----------- | :--------- |
|
| 153 |
+
| **validation** | **spearman** | **0.993212** |
|
| 154 |
+
| **test** | **spearman** | **0.993243** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
## Training Details
|
| 157 |
|
|
|
|
| 174 |
| <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code> | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code> | <code>0.6615384817123413</code> |
|
| 175 |
| <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code> | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code> | <code>0.7123287916183472</code> |
|
| 176 |
* Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
|
| 177 |
+
<details><summary>Click to expand</summary>
|
| 178 |
+
|
| 179 |
```json
|
| 180 |
{
|
| 181 |
"loss": "TanimotoSentLoss",
|
|
|
|
| 207 |
"n_dims_per_step": -1
|
| 208 |
}
|
| 209 |
```
|
| 210 |
+
</details>
|
| 211 |
|
| 212 |
### Training Hyperparameters
|
| 213 |
#### Non-Default Hyperparameters
|
|
|
|
| 513 |
|
| 514 |
## Model Card Contact
|
| 515 |
|
| 516 |
+
Manny Cortes (manny@derifyai.com)
|