gbyuvd
/

FastChemTokenizer

Feature Extraction

Model card Files Files and versions

gbyuvd commited on Sep 20, 2025

Commit

0253e7c

·

verified ·

1 Parent(s): 097a367

Update README.md

Files changed (1) hide show

README.md +6 -4

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ tags:
 Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
-For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
 ## ⚡ Performance Highlights
@@ -30,12 +30,12 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
 | **Avg sequence length**        | **21.49 tokens**  | 41.99 tokens         | 50.57 tokens        |
 | **Throughput**                 | **12,448/sec**    | 6,326/sec            | 10,658/sec          |
 | **Peak memory usage**          | **17.08 MB**      | 259.45 MB            | 387.43 MB           |
-| **UNK token rate**             | **0.0000%**       | 0.0000%              | ~0.0000% (non-zero)             |
 | **1000 encodes (benchmark)**   | **0.0029s**       | 1.6598s              | 0.5491s             |
 ✅ **1.97x faster** than ChemBERTa
 ✅ **1.50x faster** than gen-mlm-cismi-bert
-✅ **~19x memory saving** compared to both of the above tokenizer
 ✅ **No indexing errors** (avoids >512 token sequences)
 ✅ **Zero unknown tokens** on validation set
@@ -150,8 +150,9 @@ This project is an ongoing **experiment** — all contributions are welcome!
 >
 ## ✍️ On-going
 - [>] Validation on VAE and Causal LM Transformer
-- [>] Finish vocab construction on SELFIES
 - [ ] Write technical report on methods, results
 ## 📄 License
@@ -214,3 +215,4 @@ Apache 2.0
   doi = {10.1093/nar/gkac1008}
 }
 ```

 Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
+The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).
 ## ⚡ Performance Highlights
 | **Avg sequence length**        | **21.49 tokens**  | 41.99 tokens         | 50.57 tokens        |
 | **Throughput**                 | **12,448/sec**    | 6,326/sec            | 10,658/sec          |
 | **Peak memory usage**          | **17.08 MB**      | 259.45 MB            | 387.43 MB           |
+| **UNK token rate**             | **0.0000%**       | 0.0000%              | ~0.0000% (non-zero) |
 | **1000 encodes (benchmark)**   | **0.0029s**       | 1.6598s              | 0.5491s             |
 ✅ **1.97x faster** than ChemBERTa
 ✅ **1.50x faster** than gen-mlm-cismi-bert
+✅ **~19x memory saving** compared to both of the above tokenizer
 ✅ **No indexing errors** (avoids >512 token sequences)
 ✅ **Zero unknown tokens** on validation set
 >
 ## ✍️ On-going
+- [>] Redo evaluation with proper metrics and CI
 - [>] Validation on VAE and Causal LM Transformer
+- [x] Finish vocab construction on SELFIES
 - [ ] Write technical report on methods, results
 ## 📄 License
   doi = {10.1093/nar/gkac1008}
 }
 ```