gbyuvd commited on
Commit
9e83e7f
·
verified ·
1 Parent(s): d49570e

Add tutorial and ack

Browse files
Files changed (1) hide show
  1. README.md +72 -23
README.md CHANGED
@@ -124,30 +124,11 @@ New version of this model is now availale [here](https://huggingface.co/gbyuvd/c
124
  ### Disclaimer: For Academic Purposes Only
125
  The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
126
 
127
- ## Model Details
128
-
129
- ### Model Description
130
- - **Model Type:** Sentence Transformer
131
- - **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)
132
- - **Maximum Sequence Length:** 512 tokens
133
- - **Output Dimensionality:** 768 tokens
134
- - **Similarity Function:** Cosine Similarity
135
- - **Training Dataset:** SELFIES pairs generated from COCONUTDB
136
- - **Language:** SELFIES
137
- - **License:** CC BY-NC 4.0
138
-
139
-
140
- ### Full Model Architecture
141
-
142
- ```
143
- SentenceTransformer(
144
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
145
- (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
146
- )
147
- ```
148
-
149
  ## Usage
150
 
 
 
 
151
  ### Direct Usage (Sentence Transformers)
152
 
153
  First install the Sentence Transformers library:
@@ -178,6 +159,29 @@ print(similarities.shape)
178
  # [3, 3]
179
  ```
180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  ## Dataset
182
 
183
  | Dataset | Reference | Number of Pairs |
@@ -226,6 +230,51 @@ The plot below shows how the model's embeddings (at this stage) cluster differen
226
  - Datasets: 2.20.0
227
  - Tokenizers: 0.19.1
228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
  ## Contact
230
 
231
- G Bayu (gbyuvd@proton.me)
 
 
 
124
  ### Disclaimer: For Academic Purposes Only
125
  The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ## Usage
128
 
129
+ ### [Tutorial for using on a Large Dataset](https://github.com/gbyuvd/chemembed-faiss-demo)
130
+ [See this JupyterNotebook tutorial](https://github.com/gbyuvd/chemembed-faiss-demo) for using the above model to embed ~407K natural product molecular SELFIES representations (from [COCONUTDB](https://coconut.naturalproducts.net/)) and utilize [Meta's FAISS](https://github.com/facebookresearch/faiss) for indexing and doing fast searches for structurally similar compounds based on one or more inputs.
131
+
132
  ### Direct Usage (Sentence Transformers)
133
 
134
  First install the Sentence Transformers library:
 
159
  # [3, 3]
160
  ```
161
 
162
+
163
+ ## Model Details
164
+
165
+ ### Model Description
166
+ - **Model Type:** Sentence Transformer
167
+ - **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)
168
+ - **Maximum Sequence Length:** 512 tokens
169
+ - **Output Dimensionality:** 768 tokens
170
+ - **Similarity Function:** Cosine Similarity
171
+ - **Training Dataset:** SELFIES pairs generated from COCONUTDB
172
+ - **Language:** SELFIES
173
+ - **License:** CC BY-NC 4.0
174
+
175
+
176
+ ### Full Model Architecture
177
+
178
+ ```
179
+ SentenceTransformer(
180
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
181
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
182
+ )
183
+ ```
184
+
185
  ## Dataset
186
 
187
  | Dataset | Reference | Number of Pairs |
 
230
  - Datasets: 2.20.0
231
  - Tokenizers: 0.19.1
232
 
233
+ ## Acknowledgments
234
+
235
+ #### Sentence Transformers
236
+ ```bibtex
237
+ @inproceedings{reimers-2019-sentence-bert,
238
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
239
+ author = "Reimers, Nils and Gurevych, Iryna",
240
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
241
+ month = "11",
242
+ year = "2019",
243
+ publisher = "Association for Computational Linguistics",
244
+ url = "https://arxiv.org/abs/1908.10084",
245
+ }
246
+ ```
247
+
248
+ #### COCONUTDB
249
+ ```bibtex
250
+ @article{sorokina2021coconut,
251
+ title={COCONUT online: Collection of Open Natural Products database},
252
+ author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
253
+ journal={Journal of Cheminformatics},
254
+ volume={13},
255
+ number={1},
256
+ pages={2},
257
+ year={2021},
258
+ doi={10.1186/s13321-020-00478-9}
259
+ }
260
+ ```
261
+
262
+ #### SELFIES
263
+ - For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
264
+ ```bibtex
265
+ @article{krenn2020selfies,
266
+ title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
267
+ author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
268
+ journal={Machine Learning: Science and Technology},
269
+ volume={1},
270
+ number={4},
271
+ pages={045024},
272
+ year={2020},
273
+ doi={10.1088/2632-2153/aba947}
274
+ }
275
+ ```
276
  ## Contact
277
 
278
+ G Bayu (gbyuvd@proton.me)
279
+
280
+ [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/O4O710GFBZ)