eacortes commited on
Commit
8e08327
·
verified ·
1 Parent(s): aac158e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -16
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  tags:
3
  - sentence-transformers
4
  - molecular-similarity
@@ -43,7 +44,7 @@ library_name: sentence-transformers
43
  metrics:
44
  - spearman
45
  model-index:
46
- - name: SentenceTransformer based on Derify/ChemBERTa-druglike
47
  results:
48
  - task:
49
  type: semantic-similarity
@@ -55,10 +56,10 @@ model-index:
55
  - type: spearman
56
  value: 0.9932120589500998
57
  name: Spearman
58
- license: apache-2.0
59
  ---
60
 
61
- # SentenceTransformer based on Derify/ChemBERTa-druglike
62
 
63
  This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
64
 
@@ -72,7 +73,7 @@ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers
72
  - **Similarity Function:** Tanimoto
73
  - **Training Dataset:**
74
  - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
75
- - **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
76
 
77
  ### Model Sources
78
 
@@ -104,32 +105,32 @@ Then you can load this model and run inference.
104
  from chem_mrl import ChemMRL
105
 
106
  # Download from the 🤗 Hub
107
- chem_mrl = ChemMRL("Derify/ChemMRL-beta")
108
  # Run inference
109
  sentences = [
110
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
111
  "O=Cc1nc2ccccc2o1",
112
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
113
  ]
114
- embeddings = chem_mrl.backbone.encode(sentences)
115
  print(embeddings.shape)
116
  # [3, 1024]
117
 
118
  # Get the similarity scores for the embeddings
119
- similarities = chem_mrl.backbone.similarity(embeddings, embeddings)
120
  print(similarities)
121
  # tensor([[1.0000, 0.3200, 0.1209],
122
  # [0.3200, 1.0000, 0.0950],
123
  # [0.1209, 0.0950, 1.0000]])
124
 
125
  # Load the model with half precision
126
- chem_mrl = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
127
  sentences = [
128
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
129
  "O=Cc1nc2ccccc2o1",
130
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
131
  ]
132
- embeddings = chem_mrl.embed(sentences) # Use the embed method for half precision
133
  print(embeddings.shape)
134
  # [3, 1024]
135
  ```
@@ -148,10 +149,10 @@ print(embeddings.shape)
148
  }
149
  ```
150
 
151
- | Split | Metric | Value |
152
- | :--------- | :----------- | :--------- |
153
  | **validation** | **spearman** | **0.993212** |
154
- | **test** | **spearman** | **0.993243** |
155
 
156
  ## Training Details
157
 
@@ -166,7 +167,7 @@ print(embeddings.shape)
166
  | | smiles_a | smiles_b | label |
167
  | :------ | :---------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------- | :-------------------------------------------------------------- |
168
  | type | string | string | float |
169
- | details | <ul><li>min: 17 tokens</li><li>mean: 39.66 tokens</li><li>max: 119 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 38.29 tokens</li><li>max: 115 tokens</li></ul> | <ul><li>min: 0.02</li><li>mean: 0.57</li><li>max: 1.0</li></ul> | | <code>0.7123287916183472</code> |
170
  * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
171
  <details><summary>Click to expand</summary>
172
 
@@ -492,12 +493,12 @@ print(embeddings.shape)
492
 
493
  #### TanimotoSentLoss
494
  ```bibtex
495
- @online{emapco-chem-mrl-tanimotosentloss,
496
  title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
497
  author={Emmanuel Cortes},
498
  year={2025},
499
  month={Jan},
500
- url={https://github.com/emapco/chem-mrl/blob/main/chem_mrl/losses/TanimotoLoss.py},
501
  }
502
  ```
503
 
@@ -507,4 +508,4 @@ print(embeddings.shape)
507
 
508
  ## Model Card Contact
509
 
510
- Manny Cortes (manny@derifyai.com)
 
1
  ---
2
+ license: apache-2.0
3
  tags:
4
  - sentence-transformers
5
  - molecular-similarity
 
44
  metrics:
45
  - spearman
46
  model-index:
47
+ - name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
48
  results:
49
  - task:
50
  type: semantic-similarity
 
56
  - type: spearman
57
  value: 0.9932120589500998
58
  name: Spearman
59
+ new_version: Derify/ChemMRL
60
  ---
61
 
62
+ # ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer
63
 
64
  This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
65
 
 
73
  - **Similarity Function:** Tanimoto
74
  - **Training Dataset:**
75
  - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
76
+ - **License:** apache-2.0
77
 
78
  ### Model Sources
79
 
 
105
  from chem_mrl import ChemMRL
106
 
107
  # Download from the 🤗 Hub
108
+ model = ChemMRL("Derify/ChemMRL-beta")
109
  # Run inference
110
  sentences = [
111
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
112
  "O=Cc1nc2ccccc2o1",
113
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
114
  ]
115
+ embeddings = model.backbone.encode(sentences)
116
  print(embeddings.shape)
117
  # [3, 1024]
118
 
119
  # Get the similarity scores for the embeddings
120
+ similarities = model.backbone.similarity(embeddings, embeddings)
121
  print(similarities)
122
  # tensor([[1.0000, 0.3200, 0.1209],
123
  # [0.3200, 1.0000, 0.0950],
124
  # [0.1209, 0.0950, 1.0000]])
125
 
126
  # Load the model with half precision
127
+ model = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
128
  sentences = [
129
  "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
130
  "O=Cc1nc2ccccc2o1",
131
  "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
132
  ]
133
+ embeddings = model.embed(sentences) # Use the embed method for half precision
134
  print(embeddings.shape)
135
  # [3, 1024]
136
  ```
 
149
  }
150
  ```
151
 
152
+ | Split | Metric | Value |
153
+ | :------------- | :----------- | :----------- |
154
  | **validation** | **spearman** | **0.993212** |
155
+ | **test** | **spearman** | **0.993243** |
156
 
157
  ## Training Details
158
 
 
167
  | | smiles_a | smiles_b | label |
168
  | :------ | :---------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------- | :-------------------------------------------------------------- |
169
  | type | string | string | float |
170
+ | details | <ul><li>min: 17 tokens</li><li>mean: 39.66 tokens</li><li>max: 119 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 38.29 tokens</li><li>max: 115 tokens</li></ul> | <ul><li>min: 0.02</li><li>mean: 0.57</li><li>max: 1.0</li></ul> | | <code>0.7123287916183472</code> |
171
  * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
172
  <details><summary>Click to expand</summary>
173
 
 
493
 
494
  #### TanimotoSentLoss
495
  ```bibtex
496
+ @online{cortes-2025-tanimotosentloss,
497
  title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
498
  author={Emmanuel Cortes},
499
  year={2025},
500
  month={Jan},
501
+ url={https://github.com/emapco/chem-mrl},
502
  }
503
  ```
504
 
 
508
 
509
  ## Model Card Contact
510
 
511
+ Manny Cortes (manny@derifyai.com)