Derify
/

ChemMRL-beta

@@ -1,13 +1,10 @@
 ---
-language:
-- en
 tags:
 - sentence-transformers
-- sentence-similarity
 - feature-extraction
 - dense
 - generated_from_trainer
-- dataset_size:19692766
 - loss:Matryoshka2dLoss
 - loss:MatryoshkaLoss
 - loss:TanimotoSentLoss
@@ -36,7 +33,8 @@ widget:
 - source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
   sentences:
   - O=Cc1nc2ccccc2o1
-  - O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
   - O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
 datasets:
 - Derify/pubchem_10m_genmol_similarity
@@ -57,11 +55,12 @@ model-index:
     - type: spearman
       value: 0.9932120589500998
       name: Spearman
 ---
 # SentenceTransformer based on Derify/ChemBERTa-druglike
-This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, database indexing, molecular classification, clustering, and more.
 ## Model Details
@@ -73,7 +72,6 @@ This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers
 - **Similarity Function:** Tanimoto
 - **Training Dataset:**
     - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
-- **Language:** en
 - **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
 ### Model Sources
@@ -93,12 +91,12 @@ SentenceTransformer(
 ## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
-pip install -U sentence-transformers
 ```
 Then you can load this model and run inference.
@@ -106,23 +104,34 @@ Then you can load this model and run inference.
 from chem_mrl import ChemMRL
 # Download from the 🤗 Hub
-model = ChemMRL("Derify/ChemMRL-beta")
 # Run inference
 sentences = [
     "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
     "O=Cc1nc2ccccc2o1",
     "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
 ]
-embeddings = model.encode(sentences)
 print(embeddings.shape)
 # [3, 1024]
 # Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
 print(similarities)
-# tensor([[1.0000, 0.4848, 0.2158],
-#         [0.4848, 1.0000, 0.1735],
-#         [0.2158, 0.1735, 1.0000]])
 ```
 ## Evaluation
@@ -139,21 +148,10 @@ print(similarities)
   }
   ```
-| Metric       | Value      |
-| :----------- | :--------- |
-| **spearman** | **0.9932** |
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
@@ -176,6 +174,8 @@ print(similarities)
   | <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code>                                                        | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code>                                                    | <code>0.6615384817123413</code> |
   | <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code>                                      | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code>                                            | <code>0.7123287916183472</code> |
 * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
   ```json
   {
       "loss": "TanimotoSentLoss",
@@ -207,6 +207,7 @@ print(similarities)
       "n_dims_per_step": -1
   }
   ```
 ### Training Hyperparameters
 #### Non-Default Hyperparameters
@@ -512,4 +513,4 @@ print(similarities)
 ## Model Card Contact
-Manny Cortes (manny@derifyai.com)

 ---
 tags:
 - sentence-transformers
+- molecular-similarity
 - feature-extraction
 - dense
 - generated_from_trainer
 - loss:Matryoshka2dLoss
 - loss:MatryoshkaLoss
 - loss:TanimotoSentLoss
 - source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
   sentences:
   - O=Cc1nc2ccccc2o1
+  - >-
+    O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
   - O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
 datasets:
 - Derify/pubchem_10m_genmol_similarity
     - type: spearman
       value: 0.9932120589500998
       name: Spearman
+license: apache-2.0
 ---
 # SentenceTransformer based on Derify/ChemBERTa-druglike
+This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.
 ## Model Details
 - **Similarity Function:** Tanimoto
 - **Training Dataset:**
     - [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
 - **License:** [Apache-2.0](https://huggingface.co/Derify/ChemBERTa-druglike/blob/main/LICENSE)
 ### Model Sources
 ## Usage
+### Direct Usage (Chem-MRL)
+First install the Chem-MRL library:
 ```bash
+pip install -U chem-mrl>=0.7.3
 ```
 Then you can load this model and run inference.
 from chem_mrl import ChemMRL
 # Download from the 🤗 Hub
+chem_mrl = ChemMRL("Derify/ChemMRL-beta")
 # Run inference
 sentences = [
     "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
     "O=Cc1nc2ccccc2o1",
     "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
 ]
+embeddings = chem_mrl.backbone.encode(sentences)
 print(embeddings.shape)
 # [3, 1024]
 # Get the similarity scores for the embeddings
+similarities = chem_mrl.backbone.similarity(embeddings, embeddings)
 print(similarities)
+# tensor([[1.0000, 0.3200, 0.1209],
+#         [0.3200, 1.0000, 0.0950],
+#         [0.1209, 0.0950, 1.0000]])
+# Load the model with half precision
+chem_mrl = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
+sentences = [
+    "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
+    "O=Cc1nc2ccccc2o1",
+    "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
+]
+embeddings = chem_mrl.embed(sentences)  # Use the embed method for half precision
+print(embeddings.shape)
+# [3, 1024]
 ```
 ## Evaluation
   }
   ```
+| Split | Metric       | Value      |
+|  :--------- | :----------- | :--------- |
+| **validation** | **spearman** | **0.993212** |
+| **test** | **spearman** | **0.993243** |
 ## Training Details
   | <code>OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1</code>                                                        | <code>OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1</code>                                                    | <code>0.6615384817123413</code> |
   | <code>CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1</code>                                      | <code>CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1</code>                                            | <code>0.7123287916183472</code> |
 * Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
+  <details><summary>Click to expand</summary>
   ```json
   {
       "loss": "TanimotoSentLoss",
       "n_dims_per_step": -1
   }
   ```
+  </details>
 ### Training Hyperparameters
 #### Non-Default Hyperparameters
 ## Model Card Contact
+Manny Cortes (manny@derifyai.com)