Update README.md
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ tags:
|
|
| 19 |
|
| 20 |
Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
## β‘ Performance Highlights
|
| 25 |
|
|
@@ -30,12 +30,12 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
|
|
| 30 |
| **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
|
| 31 |
| **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
|
| 32 |
| **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
|
| 33 |
-
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero)
|
| 34 |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
|
| 35 |
|
| 36 |
β
**1.97x faster** than ChemBERTa
|
| 37 |
β
**1.50x faster** than gen-mlm-cismi-bert
|
| 38 |
-
β
**~19x memory saving** compared to both of the above tokenizer
|
| 39 |
β
**No indexing errors** (avoids >512 token sequences)
|
| 40 |
β
**Zero unknown tokens** on validation set
|
| 41 |
|
|
@@ -150,8 +150,9 @@ This project is an ongoing **experiment** β all contributions are welcome!
|
|
| 150 |
>
|
| 151 |
|
| 152 |
## βοΈ On-going
|
|
|
|
| 153 |
- [>] Validation on VAE and Causal LM Transformer
|
| 154 |
-
- [
|
| 155 |
- [ ] Write technical report on methods, results
|
| 156 |
|
| 157 |
## π License
|
|
@@ -214,3 +215,4 @@ Apache 2.0
|
|
| 214 |
doi = {10.1093/nar/gkac1008}
|
| 215 |
}
|
| 216 |
```
|
|
|
|
|
|
| 19 |
|
| 20 |
Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
|
| 21 |
|
| 22 |
+
The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).
|
| 23 |
|
| 24 |
## β‘ Performance Highlights
|
| 25 |
|
|
|
|
| 30 |
| **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
|
| 31 |
| **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
|
| 32 |
| **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
|
| 33 |
+
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
|
| 34 |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
|
| 35 |
|
| 36 |
β
**1.97x faster** than ChemBERTa
|
| 37 |
β
**1.50x faster** than gen-mlm-cismi-bert
|
| 38 |
+
β
**~19x memory saving** compared to both of the above tokenizer
|
| 39 |
β
**No indexing errors** (avoids >512 token sequences)
|
| 40 |
β
**Zero unknown tokens** on validation set
|
| 41 |
|
|
|
|
| 150 |
>
|
| 151 |
|
| 152 |
## βοΈ On-going
|
| 153 |
+
- [>] Redo evaluation with proper metrics and CI
|
| 154 |
- [>] Validation on VAE and Causal LM Transformer
|
| 155 |
+
- [x] Finish vocab construction on SELFIES
|
| 156 |
- [ ] Write technical report on methods, results
|
| 157 |
|
| 158 |
## π License
|
|
|
|
| 215 |
doi = {10.1093/nar/gkac1008}
|
| 216 |
}
|
| 217 |
```
|
| 218 |
+
|