Update README.md
Browse files
README.md
CHANGED
|
@@ -8,6 +8,7 @@ tags:
|
|
| 8 |
- tokenizer
|
| 9 |
---
|
| 10 |
|
|
|
|
| 11 |
# 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
|
| 12 |
|
| 13 |
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
|
|
@@ -26,10 +27,10 @@ The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datas
|
|
| 26 |
#### SMILES
|
| 27 |
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|
| 28 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 29 |
-
| **Avg time per SMILES** | **0.
|
| 30 |
-
| **Avg sequence length** | **21.
|
| 31 |
-
| **Throughput** | **
|
| 32 |
-
| **Peak memory usage** | **
|
| 33 |
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
|
| 34 |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
|
| 35 |
|
|
@@ -46,14 +47,15 @@ Core's vocab length = 781 (after pruning)
|
|
| 46 |
```
|
| 47 |
| Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|
| 48 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 49 |
-
| **Avg time per SMILES** | 0.
|
| 50 |
-
| **Avg sequence length** | **20.
|
| 51 |
-
| **Throughput** |
|
| 52 |
-
| **Peak memory usage** | **
|
| 53 |
| **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
|
| 54 |
| **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
|
| 55 |
|
| 56 |
-
✅ Even though 1.32x slower, it produces **2.65x less tokens**
|
|
|
|
| 57 |
✅ **~61x memory saving with tails** and **~25x** with core
|
| 58 |
|
| 59 |
## 🧩 Vocabulary (SMILES)
|
|
@@ -150,7 +152,7 @@ This project is an ongoing **experiment** — all contributions are welcome!
|
|
| 150 |
>
|
| 151 |
|
| 152 |
## ✍️ On-going
|
| 153 |
-
- [
|
| 154 |
- [>] Validation on VAE and Causal LM Transformer
|
| 155 |
- [x] Finish vocab construction on SELFIES
|
| 156 |
- [ ] Write technical report on methods, results
|
|
@@ -216,3 +218,16 @@ Apache 2.0
|
|
| 216 |
}
|
| 217 |
```
|
| 218 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- tokenizer
|
| 9 |
---
|
| 10 |
|
| 11 |
+
|
| 12 |
# 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
|
| 13 |
|
| 14 |
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
|
|
|
|
| 27 |
#### SMILES
|
| 28 |
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|
| 29 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 30 |
+
| **Avg time per SMILES** | **0.0692 ± 0.0038 ms** | 0.1279 ± 0.0090 ms | 0.1029 ± 0.0038 ms |
|
| 31 |
+
| **Avg sequence length** | **21.61 ± 0.70 tokens**| 42.23 ± 1.55 tokens | 50.86 ± 1.90 tokens |
|
| 32 |
+
| **Throughput** | **14,448/sec** | 7,817/sec | 9,720/sec |
|
| 33 |
+
| **Peak memory usage** | **12.92 MB** | 258.00 MB | 387.73 MB |
|
| 34 |
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
|
| 35 |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
|
| 36 |
|
|
|
|
| 47 |
```
|
| 48 |
| Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|
| 49 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 50 |
+
| **Avg time per SMILES** | 0.1882 ± 0.0140 ms| 0.1674 ± 0.0093 ms | *0.1157 ± 0.0095 ms**|
|
| 51 |
+
| **Avg sequence length** | **20.46 ± 1.21 tokens** | 33.41 ± 1.80 tokens | 54.29 ± 3.08 tokens |
|
| 52 |
+
| **Throughput** | 5,313/sec | 5,973/sec | **8,642 /sec** |
|
| 53 |
+
| **Peak memory usage** | **9.32 MB** | 20.16 MB | 490.13 MB |
|
| 54 |
| **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
|
| 55 |
| **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
|
| 56 |
|
| 57 |
+
✅ Even though 1.32x slower, it produces **2.65x less tokens**
|
| 58 |
+
- this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
|
| 59 |
✅ **~61x memory saving with tails** and **~25x** with core
|
| 60 |
|
| 61 |
## 🧩 Vocabulary (SMILES)
|
|
|
|
| 152 |
>
|
| 153 |
|
| 154 |
## ✍️ On-going
|
| 155 |
+
- [x] Redo evaluation with proper metrics and CI
|
| 156 |
- [>] Validation on VAE and Causal LM Transformer
|
| 157 |
- [x] Finish vocab construction on SELFIES
|
| 158 |
- [ ] Write technical report on methods, results
|
|
|
|
| 218 |
}
|
| 219 |
```
|
| 220 |
|
| 221 |
+
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|