Add tutorial and ack
Browse files
README.md
CHANGED
|
@@ -124,30 +124,11 @@ New version of this model is now availale [here](https://huggingface.co/gbyuvd/c
|
|
| 124 |
### Disclaimer: For Academic Purposes Only
|
| 125 |
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
|
| 126 |
|
| 127 |
-
## Model Details
|
| 128 |
-
|
| 129 |
-
### Model Description
|
| 130 |
-
- **Model Type:** Sentence Transformer
|
| 131 |
-
- **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)
|
| 132 |
-
- **Maximum Sequence Length:** 512 tokens
|
| 133 |
-
- **Output Dimensionality:** 768 tokens
|
| 134 |
-
- **Similarity Function:** Cosine Similarity
|
| 135 |
-
- **Training Dataset:** SELFIES pairs generated from COCONUTDB
|
| 136 |
-
- **Language:** SELFIES
|
| 137 |
-
- **License:** CC BY-NC 4.0
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
### Full Model Architecture
|
| 141 |
-
|
| 142 |
-
```
|
| 143 |
-
SentenceTransformer(
|
| 144 |
-
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
| 145 |
-
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
|
| 146 |
-
)
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
## Usage
|
| 150 |
|
|
|
|
|
|
|
|
|
|
| 151 |
### Direct Usage (Sentence Transformers)
|
| 152 |
|
| 153 |
First install the Sentence Transformers library:
|
|
@@ -178,6 +159,29 @@ print(similarities.shape)
|
|
| 178 |
# [3, 3]
|
| 179 |
```
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
## Dataset
|
| 182 |
|
| 183 |
| Dataset | Reference | Number of Pairs |
|
|
@@ -226,6 +230,51 @@ The plot below shows how the model's embeddings (at this stage) cluster differen
|
|
| 226 |
- Datasets: 2.20.0
|
| 227 |
- Tokenizers: 0.19.1
|
| 228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
## Contact
|
| 230 |
|
| 231 |
-
G Bayu (gbyuvd@proton.me)
|
|
|
|
|
|
|
|
|
| 124 |
### Disclaimer: For Academic Purposes Only
|
| 125 |
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
|
| 126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
## Usage
|
| 128 |
|
| 129 |
+
### [Tutorial for using on a Large Dataset](https://github.com/gbyuvd/chemembed-faiss-demo)
|
| 130 |
+
[See this JupyterNotebook tutorial](https://github.com/gbyuvd/chemembed-faiss-demo) for using the above model to embed ~407K natural product molecular SELFIES representations (from [COCONUTDB](https://coconut.naturalproducts.net/)) and utilize [Meta's FAISS](https://github.com/facebookresearch/faiss) for indexing and doing fast searches for structurally similar compounds based on one or more inputs.
|
| 131 |
+
|
| 132 |
### Direct Usage (Sentence Transformers)
|
| 133 |
|
| 134 |
First install the Sentence Transformers library:
|
|
|
|
| 159 |
# [3, 3]
|
| 160 |
```
|
| 161 |
|
| 162 |
+
|
| 163 |
+
## Model Details
|
| 164 |
+
|
| 165 |
+
### Model Description
|
| 166 |
+
- **Model Type:** Sentence Transformer
|
| 167 |
+
- **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased)
|
| 168 |
+
- **Maximum Sequence Length:** 512 tokens
|
| 169 |
+
- **Output Dimensionality:** 768 tokens
|
| 170 |
+
- **Similarity Function:** Cosine Similarity
|
| 171 |
+
- **Training Dataset:** SELFIES pairs generated from COCONUTDB
|
| 172 |
+
- **Language:** SELFIES
|
| 173 |
+
- **License:** CC BY-NC 4.0
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
### Full Model Architecture
|
| 177 |
+
|
| 178 |
+
```
|
| 179 |
+
SentenceTransformer(
|
| 180 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
| 181 |
+
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
|
| 182 |
+
)
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
## Dataset
|
| 186 |
|
| 187 |
| Dataset | Reference | Number of Pairs |
|
|
|
|
| 230 |
- Datasets: 2.20.0
|
| 231 |
- Tokenizers: 0.19.1
|
| 232 |
|
| 233 |
+
## Acknowledgments
|
| 234 |
+
|
| 235 |
+
#### Sentence Transformers
|
| 236 |
+
```bibtex
|
| 237 |
+
@inproceedings{reimers-2019-sentence-bert,
|
| 238 |
+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 239 |
+
author = "Reimers, Nils and Gurevych, Iryna",
|
| 240 |
+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 241 |
+
month = "11",
|
| 242 |
+
year = "2019",
|
| 243 |
+
publisher = "Association for Computational Linguistics",
|
| 244 |
+
url = "https://arxiv.org/abs/1908.10084",
|
| 245 |
+
}
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
#### COCONUTDB
|
| 249 |
+
```bibtex
|
| 250 |
+
@article{sorokina2021coconut,
|
| 251 |
+
title={COCONUT online: Collection of Open Natural Products database},
|
| 252 |
+
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
|
| 253 |
+
journal={Journal of Cheminformatics},
|
| 254 |
+
volume={13},
|
| 255 |
+
number={1},
|
| 256 |
+
pages={2},
|
| 257 |
+
year={2021},
|
| 258 |
+
doi={10.1186/s13321-020-00478-9}
|
| 259 |
+
}
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
#### SELFIES
|
| 263 |
+
- For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
|
| 264 |
+
```bibtex
|
| 265 |
+
@article{krenn2020selfies,
|
| 266 |
+
title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
|
| 267 |
+
author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
|
| 268 |
+
journal={Machine Learning: Science and Technology},
|
| 269 |
+
volume={1},
|
| 270 |
+
number={4},
|
| 271 |
+
pages={045024},
|
| 272 |
+
year={2020},
|
| 273 |
+
doi={10.1088/2632-2153/aba947}
|
| 274 |
+
}
|
| 275 |
+
```
|
| 276 |
## Contact
|
| 277 |
|
| 278 |
+
G Bayu (gbyuvd@proton.me)
|
| 279 |
+
|
| 280 |
+
[](https://ko-fi.com/O4O710GFBZ)
|