FastChemTokenizer / README.md
gbyuvd's picture
Update README.md
07e938b verified
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- chemistry
- tokenizer
---
# πŸ§ͺ FastChemTokenizer β€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
## πŸš€ Overview
`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).
A tentative technical report can be read [here](https://amachinewithorgans.wordpress.com/2025/09/27/fastchemtokenizer-a-new-approach-to-chemical-language-processing-via-statistical-info-theoretic-motif-mining/)
## ⚑ Performance Highlights
#### SMILES
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES** | **0.0692 Β± 0.0038 ms** | 0.1279 Β± 0.0090 ms | 0.1029 Β± 0.0038 ms |
| **Avg sequence length** | **21.61 Β± 0.70 tokens**| 42.23 Β± 1.55 tokens | 50.86 Β± 1.90 tokens |
| **Throughput** | **14,448/sec** | 7,817/sec | 9,720/sec |
| **Peak memory usage** | **12.92 MB** | 258.00 MB | 387.73 MB |
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
βœ… **1.97x faster** than ChemBERTa
βœ… **1.50x faster** than gen-mlm-cismi-bert
βœ… **~19x memory saving** compared to both of the above tokenizer
βœ… **No indexing errors** (avoids >512 token sequences)
βœ… **Zero unknown tokens** on validation set
#### SELFIES
```
Core's vocab length = 781 (after pruning)
with tails = 1161 (after pruning)
```
| Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES** | 0.1882 Β± 0.0140 ms| 0.1674 Β± 0.0093 ms | **0.1157 Β± 0.0095 ms**|
| **Avg sequence length** | **20.46 Β± 1.21 tokens** | 33.41 Β± 1.80 tokens | 54.29 Β± 3.08 tokens |
| **Throughput** | 5,313/sec | 5,973/sec | **8,642 /sec** |
| **Peak memory usage** | **9.32 MB** | 20.16 MB | 490.13 MB |
| **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
| **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
βœ… Even though 1.32x slower, it produces **2.65x less tokens**
- this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
βœ… **~61x memory saving with tails** and **~25x** with core
## 🧩 Vocabulary (SMILES)
- **Final vocab size**: 1,238 tokens
- **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (`<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`)
- **Pruned**: 270 unused tokens (e.g., `'²'`, `'C@@H](O)['`, `'È'`)
- **Training corpus**: ~119M unigrams from ~3M SMILES sequences
- **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95
## πŸ› οΈ Implementation
- **Algorithm**: Trie-based longest-prefix-match
- **Caching**: `@lru_cache` for repeated string encoding
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
- **Memory Efficient**: Trie traversal and cache
**for SMILES (core backbone vocabs without tails)**
for with tails, use `./smitok`
if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF`
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
benzene = "c1ccccc1"
encoded = tokenizer.encode(benzene)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# βœ… Encoded: [271, 474, 840]
# βœ… Decoded: c1ccccc1
#
# πŸ” Decoding 3 tokens:
# [000] ID= 271 β†’ 'c1ccc'
# [001] ID= 474 β†’ 'cc'
# [002] ID= 840 β†’ '1'
```
**for SELFIES**
Please don't use the old `FastChemTokenizer` for SELFIES, use the HF one
```python
from FastChemTokenizerHF import FastChemTokenizerSelfies
tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
encoded = tokenizer.encode(benzene)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# βœ… Encoded: [0, 257, 640, 693, 402, 1]
# βœ… Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>
# πŸ” Decoding 6 tokens:
# [000] ID= 0 β†’ '<s>'
# [001] ID= 257 β†’ '[C] [=C] [C] [=C] [C]'
# [002] ID= 640 β†’ '[=C]'
# [003] ID= 693 β†’ '[Ring1]'
# [004] ID= 402 β†’ '[=Branch1]'
# [005] ID= 1 β†’ '</s>'
```
#### BigSMILES (experimental)
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("./bigsmiles-proto")
testentry = "*CC(*)c1ccccc1C(=O)OCCCCCC"
encoded = tokenizer.encode(testentry)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# βœ… Encoded: [186, 185, 723, 31, 439]
# βœ… Decoded: *CC(*)c1ccccc1C(=O)OCCCCCC
#
# πŸ” Decoding 5 tokens:
# [000] ID= 186 β†’ '*CC(*)'
# [001] ID= 185 β†’ 'c1cccc'
# [002] ID= 723 β†’ 'c1'
# [003] ID= 31 β†’ 'C(=O)OCC'
# [004] ID= 439 β†’ 'CCCC'
```
## πŸ“¦ Installation & Usage
0. Make sure you have all the reqs packages, possibly can be run with different versions
1. Clone this repository to a directory
2. Load with:
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("./smitok_core")
```
3. Use like any Hugging Face tokenizer:
```python
outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512)
```
## πŸ“š Models using this tokenizer:
- [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
- [ChemMiniQ3-SAbRLo](https://huggingface.co/gbyuvd/ChemMiniQ3-SAbRLo)
## πŸ“š Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
Using `benchmark_simpler.py`: 1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc)
Latent Space Visualization based on SMILES Interpolation Validity
![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/sfzBvmJR-ovjpe5F7vNR4.png)
using smitok (with tails)
![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/-TusjDSYv9J3K-pfb0hqu.png)
```text
Train: 13017
Val: 1627
Test: 1628
=== Benchmarking ChemBERTa ===
vocab_size : 767
avg_tokens_per_mol : 25.0359
compression_ratio : 1.3766
percent_unknown : 0.0000
encode_throughput_smiles_per_sec : 4585.2022
decode_throughput_smiles_per_sec : 18168.2779
decode_reconstruction_accuracy : 100.0000
=== Benchmarking FastChemTokenizerHF ===
vocab_size : 1238
avg_tokens_per_mol : 13.5668
compression_ratio : 2.5403
percent_unknown : 0.0000
encode_throughput_smiles_per_sec : 32005.8686
decode_throughput_smiles_per_sec : 29807.3610
decode_reconstruction_accuracy : 100.0000
```
## πŸ”§ Contributing
This project is an ongoing **experiment** β€” all contributions are welcome!
- 🧠 Have a better way to implement the methods?
- πŸ“Š Want to add evaluation metrics?
- ✨ Found a bug? Please open an issue!
πŸ‘‰ Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.
## ⚠️ Disclaimer
> **This is NOT a production ready tokenizer.**
>
> - Built during late-night prototyping sessions πŸŒ™
> - Not yet validated on downstream task
> - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon!
> - I’m still learning ML/AI~
>
## ✍️ On-going
- [x] Redo evaluation with proper metrics and CI
- [>] Validation on VAE and Causal LM Transformer
- [x] Finish vocab construction on SELFIES
- [>] Write technical report on methods, results
## πŸ“„ License
Apache 2.0
## πŸ™ Credits
- Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5)
- Built for efficiency
- Code & fragments vocab by gbyuvd
## References
### BibTeX
#### COCONUTDB
```bibtex
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
```
#### ChemBL34
```bibtex
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
```
#### SuperNatural3
```bibtex
@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}
```
---