gbyuvd commited on
Commit
926b879
·
verified ·
1 Parent(s): 0253e7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -10
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - tokenizer
9
  ---
10
 
 
11
  # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
12
 
13
  > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
@@ -26,10 +27,10 @@ The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datas
26
  #### SMILES
27
  | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
28
  |--------------------------------|-------------------|----------------------|---------------------|
29
- | **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
30
- | **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
31
- | **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
32
- | **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
33
  | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
34
  | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
35
 
@@ -46,14 +47,15 @@ Core's vocab length = 781 (after pruning)
46
  ```
47
  | Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
48
  |--------------------------------|-------------------|----------------------|---------------------|
49
- | **Avg time per SMILES** | 0.1548 ms | 0.1700 ms | **0.1170 ms** |
50
- | **Avg sequence length** | **20.34 tokens** | 33.22 tokens | 53.98 tokens |
51
- | **Throughput** | 6,461/sec | 5,882/sec | **8,549/sec** |
52
- | **Peak memory usage** | **7.96 MB** | 19.77 MB | 488.03 MB |
53
  | **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
54
  | **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
55
 
56
- ✅ Even though 1.32x slower, it produces **2.65x less tokens**
 
57
  ✅ **~61x memory saving with tails** and **~25x** with core
58
 
59
  ## 🧩 Vocabulary (SMILES)
@@ -150,7 +152,7 @@ This project is an ongoing **experiment** — all contributions are welcome!
150
  >
151
 
152
  ## ✍️ On-going
153
- - [>] Redo evaluation with proper metrics and CI
154
  - [>] Validation on VAE and Causal LM Transformer
155
  - [x] Finish vocab construction on SELFIES
156
  - [ ] Write technical report on methods, results
@@ -216,3 +218,16 @@ Apache 2.0
216
  }
217
  ```
218
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - tokenizer
9
  ---
10
 
11
+
12
  # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
13
 
14
  > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
 
27
  #### SMILES
28
  | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
29
  |--------------------------------|-------------------|----------------------|---------------------|
30
+ | **Avg time per SMILES** | **0.0692 ± 0.0038 ms** | 0.1279 ± 0.0090 ms | 0.1029 ± 0.0038 ms |
31
+ | **Avg sequence length** | **21.61 ± 0.70 tokens**| 42.23 ± 1.55 tokens | 50.86 ± 1.90 tokens |
32
+ | **Throughput** | **14,448/sec** | 7,817/sec | 9,720/sec |
33
+ | **Peak memory usage** | **12.92 MB** | 258.00 MB | 387.73 MB |
34
  | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
35
  | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
36
 
 
47
  ```
48
  | Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
49
  |--------------------------------|-------------------|----------------------|---------------------|
50
+ | **Avg time per SMILES** | 0.1882 ± 0.0140 ms| 0.1674 ± 0.0093 ms | *0.1157 ± 0.0095 ms**|
51
+ | **Avg sequence length** | **20.46 ± 1.21 tokens** | 33.41 ± 1.80 tokens | 54.29 ± 3.08 tokens |
52
+ | **Throughput** | 5,313/sec | 5,973/sec | **8,642 /sec** |
53
+ | **Peak memory usage** | **9.32 MB** | 20.16 MB | 490.13 MB |
54
  | **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
55
  | **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
56
 
57
+ ✅ Even though 1.32x slower, it produces **2.65x less tokens**
58
+ - this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
59
  ✅ **~61x memory saving with tails** and **~25x** with core
60
 
61
  ## 🧩 Vocabulary (SMILES)
 
152
  >
153
 
154
  ## ✍️ On-going
155
+ - [x] Redo evaluation with proper metrics and CI
156
  - [>] Validation on VAE and Causal LM Transformer
157
  - [x] Finish vocab construction on SELFIES
158
  - [ ] Write technical report on methods, results
 
218
  }
219
  ```
220
 
221
+
222
+
223
+
224
+
225
+
226
+
227
+
228
+
229
+
230
+
231
+
232
+
233
+