gbyuvd commited on
Commit
0253e7c
Β·
verified Β·
1 Parent(s): 097a367

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -19,7 +19,7 @@ tags:
19
 
20
  Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
21
 
22
- For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
23
 
24
  ## ⚑ Performance Highlights
25
 
@@ -30,12 +30,12 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
30
  | **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
31
  | **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
32
  | **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
33
- | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
34
  | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
35
 
36
  βœ… **1.97x faster** than ChemBERTa
37
  βœ… **1.50x faster** than gen-mlm-cismi-bert
38
- βœ… **~19x memory saving** compared to both of the above tokenizer
39
  βœ… **No indexing errors** (avoids >512 token sequences)
40
  βœ… **Zero unknown tokens** on validation set
41
 
@@ -150,8 +150,9 @@ This project is an ongoing **experiment** β€” all contributions are welcome!
150
  >
151
 
152
  ## ✍️ On-going
 
153
  - [>] Validation on VAE and Causal LM Transformer
154
- - [>] Finish vocab construction on SELFIES
155
  - [ ] Write technical report on methods, results
156
 
157
  ## πŸ“„ License
@@ -214,3 +215,4 @@ Apache 2.0
214
  doi = {10.1093/nar/gkac1008}
215
  }
216
  ```
 
 
19
 
20
  Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
21
 
22
+ The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).
23
 
24
  ## ⚑ Performance Highlights
25
 
 
30
  | **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
31
  | **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
32
  | **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
33
+ | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
34
  | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
35
 
36
  βœ… **1.97x faster** than ChemBERTa
37
  βœ… **1.50x faster** than gen-mlm-cismi-bert
38
+ βœ… **~19x memory saving** compared to both of the above tokenizer
39
  βœ… **No indexing errors** (avoids >512 token sequences)
40
  βœ… **Zero unknown tokens** on validation set
41
 
 
150
  >
151
 
152
  ## ✍️ On-going
153
+ - [>] Redo evaluation with proper metrics and CI
154
  - [>] Validation on VAE and Causal LM Transformer
155
+ - [x] Finish vocab construction on SELFIES
156
  - [ ] Write technical report on methods, results
157
 
158
  ## πŸ“„ License
 
215
  doi = {10.1093/nar/gkac1008}
216
  }
217
  ```
218
+