gbyuvd
/

FastChemTokenizer

@@ -8,6 +8,7 @@ tags:
 - tokenizer
 ---
 # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
 > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
@@ -26,10 +27,10 @@ The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datas
 #### SMILES
 | Metric                          | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
 |--------------------------------|-------------------|----------------------|---------------------|
-| **Avg time per SMILES**        | **0.0803 ms**     | 0.1581 ms            | 0.0938 ms           |
-| **Avg sequence length**        | **21.49 tokens**  | 41.99 tokens         | 50.57 tokens        |
-| **Throughput**                 | **12,448/sec**    | 6,326/sec            | 10,658/sec          |
-| **Peak memory usage**          | **17.08 MB**      | 259.45 MB            | 387.43 MB           |
 | **UNK token rate**             | **0.0000%**       | 0.0000%              | ~0.0000% (non-zero) |
 | **1000 encodes (benchmark)**   | **0.0029s**       | 1.6598s              | 0.5491s             |
@@ -46,14 +47,15 @@ Core's vocab length = 781 (after pruning)
 ```
 | Metric                         | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
 |--------------------------------|-------------------|----------------------|---------------------|
-| **Avg time per SMILES**        | 0.1548 ms         | 0.1700 ms            | **0.1170 ms**       |
-| **Avg sequence length**        | **20.34 tokens**  | 33.22 tokens         | 53.98 tokens        |
-| **Throughput**                 | 6,461/sec         | 5,882/sec            | **8,549/sec**       |
-| **Peak memory usage**          | **7.96 MB**       | 19.77 MB             | 488.03 MB           |
 | **UNK token rate**             | **0.0000%**       | 0.0000%              | 0.0000%             |
 | **1000 encodes (benchmark)**   | **0.0081s**       | 2.9020s              | 2.9020s             |
-✅ Even though 1.32x slower, it produces **2.65x less tokens**
 ✅ **~61x memory saving with tails** and **~25x** with core
 ## 🧩 Vocabulary (SMILES)
@@ -150,7 +152,7 @@ This project is an ongoing **experiment** — all contributions are welcome!
 >
 ## ✍️ On-going
-- [>] Redo evaluation with proper metrics and CI
 - [>] Validation on VAE and Causal LM Transformer
 - [x] Finish vocab construction on SELFIES
 - [ ] Write technical report on methods, results
@@ -216,3 +218,16 @@ Apache 2.0
 }
 ```

 - tokenizer
 ---
 # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
 > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
 #### SMILES
 | Metric                          | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
 |--------------------------------|-------------------|----------------------|---------------------|
+| **Avg time per SMILES**        | **0.0692 ± 0.0038 ms**  | 0.1279 ± 0.0090 ms   | 0.1029 ± 0.0038  ms |
+| **Avg sequence length**        | **21.61 ± 0.70 tokens**| 42.23 ± 1.55 tokens  | 50.86 ± 1.90 tokens |
+| **Throughput**                 | **14,448/sec**    | 7,817/sec            | 9,720/sec           |
+| **Peak memory usage**          | **12.92 MB**      | 258.00 MB            | 387.73 MB           |
 | **UNK token rate**             | **0.0000%**       | 0.0000%              | ~0.0000% (non-zero) |
 | **1000 encodes (benchmark)**   | **0.0029s**       | 1.6598s              | 0.5491s             |
 ```
 | Metric                         | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
 |--------------------------------|-------------------|----------------------|---------------------|
+| **Avg time per SMILES**        | 0.1882 ± 0.0140 ms| 0.1674 ± 0.0093 ms   | *0.1157 ± 0.0095 ms**|
+| **Avg sequence length**        | **20.46 ± 1.21 tokens**  | 33.41  ± 1.80 tokens | 54.29 ± 3.08 tokens |
+| **Throughput**                 | 5,313/sec         | 5,973/sec            | **8,642 /sec**      |
+| **Peak memory usage**          | **9.32 MB**       | 20.16 MB             | 490.13 MB           |
 | **UNK token rate**             | **0.0000%**       | 0.0000%              | 0.0000%             |
 | **1000 encodes (benchmark)**   | **0.0081s**       | 2.9020s              | 2.9020s             |
+✅ Even though 1.32x slower, it produces **2.65x less tokens**
+        - this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
 ✅ **~61x memory saving with tails** and **~25x** with core
 ## 🧩 Vocabulary (SMILES)
 >
 ## ✍️ On-going
+- [x] Redo evaluation with proper metrics and CI
 - [>] Validation on VAE and Causal LM Transformer
 - [x] Finish vocab construction on SELFIES
 - [ ] Write technical report on methods, results
 }
 ```