eacortes commited on
Commit
362013e
·
verified ·
1 Parent(s): aa361a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -32
README.md CHANGED
@@ -1,13 +1,10 @@
1
  ---
2
- language:
3
- - en
4
  tags:
5
  - sentence-transformers
6
- - sentence-similarity
7
  - feature-extraction
8
  - dense
9
  - generated_from_trainer
10
- - dataset_size:19692766
11
  - loss:Matryoshka2dLoss
12
  - loss:MatryoshkaLoss
13
  - loss:TanimotoSentLoss
@@ -36,7 +33,8 @@ widget:
36
  - source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
37
  sentences:
38
  - O=Cc1nc2ccccc2o1
39
- - O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
 
40
  - O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
41
  datasets:
42
  - Derify/pubchem_10m_genmol_similarity
@@ -57,11 +55,12 @@ model-index:
57
  - type: spearman
58
  value: 0.9932120589500998
59
  name: Spearman
 
60
  ---
61
 
62
  # SentenceTransformer based on Derify/ChemBERTa-druglike
63
 
64
- This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, database indexing, molecular classification, clustering, and more.
65
 
66
  ## Model Details
67
 
@@ -73,7 +72,6 @@ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers
73
  - **Similarity Function:** Tanimoto
74
  - **Training Dataset:**
75
  - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
76
- - **Language:** en
77
  - **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
78
 
79
  ### Model Sources
@@ -93,12 +91,12 @@ SentenceTransformer(
93
 
94
  ## Usage
95
 
96
- ### Direct Usage (Sentence Transformers)
97
 
98
- First install the Sentence Transformers library:
99
 
100
  ```bash
101
- pip install -U sentence-transformers
102
  ```
103
 
104
  Then you can load this model and run inference.
@@ -106,23 +104,34 @@ Then you can load this model and run inference.
106
  from chem_mrl import ChemMRL
107
 
108
  # Download from the 🤗 Hub
109
- model = ChemMRL("Derify/ChemMRL-beta")
110
  # Run inference
111
  sentences = [
112
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
113
  "O=Cc1nc2ccccc2o1",
114
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
115
  ]
116
- embeddings = model.encode(sentences)
117
  print(embeddings.shape)
118
  # [3, 1024]
119
 
120
  # Get the similarity scores for the embeddings
121
- similarities = model.similarity(embeddings, embeddings)
122
  print(similarities)
123
- # tensor([[1.0000, 0.4848, 0.2158],
124
- # [0.4848, 1.0000, 0.1735],
125
- # [0.2158, 0.1735, 1.0000]])
 
 
 
 
 
 
 
 
 
 
 
126
  ```
127
 
128
  ## Evaluation
@@ -139,21 +148,10 @@ print(similarities)
139
  }
140
  ```
141
 
142
- | Metric | Value |
143
- | :----------- | :--------- |
144
- | **spearman** | **0.9932** |
145
-
146
- <!--
147
- ## Bias, Risks and Limitations
148
-
149
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
150
- -->
151
-
152
- <!--
153
- ### Recommendations
154
-
155
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
156
- -->
157
 
158
  ## Training Details
159
 
@@ -176,6 +174,8 @@ print(similarities)
176
  | <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code> | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code> | <code>0.6615384817123413</code> |
177
  | <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code> | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code> | <code>0.7123287916183472</code> |
178
  * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
 
 
179
  ```json
180
  {
181
  "loss": "TanimotoSentLoss",
@@ -207,6 +207,7 @@ print(similarities)
207
  "n_dims_per_step": -1
208
  }
209
  ```
 
210
 
211
  ### Training Hyperparameters
212
  #### Non-Default Hyperparameters
@@ -512,4 +513,4 @@ print(similarities)
512
 
513
  ## Model Card Contact
514
 
515
- Manny Cortes (manny@derifyai.com)
 
1
  ---
 
 
2
  tags:
3
  - sentence-transformers
4
+ - molecular-similarity
5
  - feature-extraction
6
  - dense
7
  - generated_from_trainer
 
8
  - loss:Matryoshka2dLoss
9
  - loss:MatryoshkaLoss
10
  - loss:TanimotoSentLoss
 
33
  - source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
34
  sentences:
35
  - O=Cc1nc2ccccc2o1
36
+ - >-
37
+ O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
38
  - O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
39
  datasets:
40
  - Derify/pubchem_10m_genmol_similarity
 
55
  - type: spearman
56
  value: 0.9932120589500998
57
  name: Spearman
58
+ license: apache-2.0
59
  ---
60
 
61
  # SentenceTransformer based on Derify/ChemBERTa-druglike
62
 
63
+ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
64
 
65
  ## Model Details
66
 
 
72
  - **Similarity Function:** Tanimoto
73
  - **Training Dataset:**
74
  - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
 
75
  - **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
76
 
77
  ### Model Sources
 
91
 
92
  ## Usage
93
 
94
+ ### Direct Usage (Chem-MRL)
95
 
96
+ First install the Chem-MRL library:
97
 
98
  ```bash
99
+ pip install -U chem-mrl>=0.7.3
100
  ```
101
 
102
  Then you can load this model and run inference.
 
104
  from chem_mrl import ChemMRL
105
 
106
  # Download from the 🤗 Hub
107
+ chem_mrl = ChemMRL("Derify/ChemMRL-beta")
108
  # Run inference
109
  sentences = [
110
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
111
  "O=Cc1nc2ccccc2o1",
112
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
113
  ]
114
+ embeddings = chem_mrl.backbone.encode(sentences)
115
  print(embeddings.shape)
116
  # [3, 1024]
117
 
118
  # Get the similarity scores for the embeddings
119
+ similarities = chem_mrl.backbone.similarity(embeddings, embeddings)
120
  print(similarities)
121
+ # tensor([[1.0000, 0.3200, 0.1209],
122
+ # [0.3200, 1.0000, 0.0950],
123
+ # [0.1209, 0.0950, 1.0000]])
124
+
125
+ # Load the model with half precision
126
+ chem_mrl = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
127
+ sentences = [
128
+ "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
129
+ "O=Cc1nc2ccccc2o1",
130
+ "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
131
+ ]
132
+ embeddings = chem_mrl.embed(sentences) # Use the embed method for half precision
133
+ print(embeddings.shape)
134
+ # [3, 1024]
135
  ```
136
 
137
  ## Evaluation
 
148
  }
149
  ```
150
 
151
+ | Split | Metric | Value |
152
+ | :--------- | :----------- | :--------- |
153
+ | **validation** | **spearman** | **0.993212** |
154
+ | **test** | **spearman** | **0.993243** |
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  ## Training Details
157
 
 
174
  | <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code> | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code> | <code>0.6615384817123413</code> |
175
  | <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code> | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code> | <code>0.7123287916183472</code> |
176
  * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
177
+ <details><summary>Click to expand</summary>
178
+
179
  ```json
180
  {
181
  "loss": "TanimotoSentLoss",
 
207
  "n_dims_per_step": -1
208
  }
209
  ```
210
+ </details>
211
 
212
  ### Training Hyperparameters
213
  #### Non-Default Hyperparameters
 
513
 
514
  ## Model Card Contact
515
 
516
+ Manny Cortes (manny@derifyai.com)