--- license: apache-2.0 pipeline_tag: feature-extraction tags: - chemistry - tokenizer --- # ๐Ÿงช FastChemTokenizer โ€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.** ## ๐Ÿš€ Overview `FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs โ€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage. Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens. The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen). A tentative technical report can be read [here](https://amachinewithorgans.wordpress.com/2025/09/27/fastchemtokenizer-a-new-approach-to-chemical-language-processing-via-statistical-info-theoretic-motif-mining/) ## โšก Performance Highlights #### SMILES | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) | |--------------------------------|-------------------|----------------------|---------------------| | **Avg time per SMILES** | **0.0692 ยฑ 0.0038 ms** | 0.1279 ยฑ 0.0090 ms | 0.1029 ยฑ 0.0038 ms | | **Avg sequence length** | **21.61 ยฑ 0.70 tokens**| 42.23 ยฑ 1.55 tokens | 50.86 ยฑ 1.90 tokens | | **Throughput** | **14,448/sec** | 7,817/sec | 9,720/sec | | **Peak memory usage** | **12.92 MB** | 258.00 MB | 387.73 MB | | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) | | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s | โœ… **1.97x faster** than ChemBERTa โœ… **1.50x faster** than gen-mlm-cismi-bert โœ… **~19x memory saving** compared to both of the above tokenizer โœ… **No indexing errors** (avoids >512 token sequences) โœ… **Zero unknown tokens** on validation set #### SELFIES ``` Core's vocab length = 781 (after pruning) with tails = 1161 (after pruning) ``` | Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) | |--------------------------------|-------------------|----------------------|---------------------| | **Avg time per SMILES** | 0.1882 ยฑ 0.0140 ms| 0.1674 ยฑ 0.0093 ms | **0.1157 ยฑ 0.0095 ms**| | **Avg sequence length** | **20.46 ยฑ 1.21 tokens** | 33.41 ยฑ 1.80 tokens | 54.29 ยฑ 3.08 tokens | | **Throughput** | 5,313/sec | 5,973/sec | **8,642 /sec** | | **Peak memory usage** | **9.32 MB** | 20.16 MB | 490.13 MB | | **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% | | **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s | โœ… Even though 1.32x slower, it produces **2.65x less tokens** - this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings โœ… **~61x memory saving with tails** and **~25x** with core ## ๐Ÿงฉ Vocabulary (SMILES) - **Final vocab size**: 1,238 tokens - **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (``, ``, ``, ``, ``) - **Pruned**: 270 unused tokens (e.g., `'ยฒ'`, `'C@@H](O)['`, `'รˆ'`) - **Training corpus**: ~119M unigrams from ~3M SMILES sequences - **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95 ## ๐Ÿ› ๏ธ Implementation - **Algorithm**: Trie-based longest-prefix-match - **Caching**: `@lru_cache` for repeated string encoding - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained` - **Memory Efficient**: Trie traversal and cache **for SMILES (core backbone vocabs without tails)** for with tails, use `./smitok` if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF` ```python from FastChemTokenizer import FastChemTokenizer tokenizer = FastChemTokenizer.from_pretrained("../smitok_core") benzene = "c1ccccc1" encoded = tokenizer.encode(benzene) print("โœ… Encoded:", encoded) decoded = tokenizer.decode(encoded) print("โœ… Decoded:", decoded) tokenizer.decode_with_trace(encoded) # โœ… Encoded: [271, 474, 840] # โœ… Decoded: c1ccccc1 # # ๐Ÿ” Decoding 3 tokens: # [000] ID= 271 โ†’ 'c1ccc' # [001] ID= 474 โ†’ 'cc' # [002] ID= 840 โ†’ '1' ``` **for SELFIES** Please don't use the old `FastChemTokenizer` for SELFIES, use the HF one ```python from FastChemTokenizerHF import FastChemTokenizerSelfies tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input encoded = tokenizer.encode(benzene) print("โœ… Encoded:", encoded) decoded = tokenizer.decode(encoded) print("โœ… Decoded:", decoded) tokenizer.decode_with_trace(encoded) # โœ… Encoded: [0, 257, 640, 693, 402, 1] # โœ… Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] # ๐Ÿ” Decoding 6 tokens: # [000] ID= 0 โ†’ '' # [001] ID= 257 โ†’ '[C] [=C] [C] [=C] [C]' # [002] ID= 640 โ†’ '[=C]' # [003] ID= 693 โ†’ '[Ring1]' # [004] ID= 402 โ†’ '[=Branch1]' # [005] ID= 1 โ†’ '' ``` #### BigSMILES (experimental) ```python from FastChemTokenizer import FastChemTokenizer tokenizer = FastChemTokenizer.from_pretrained("./bigsmiles-proto") testentry = "*CC(*)c1ccccc1C(=O)OCCCCCC" encoded = tokenizer.encode(testentry) print("โœ… Encoded:", encoded) decoded = tokenizer.decode(encoded) print("โœ… Decoded:", decoded) tokenizer.decode_with_trace(encoded) # โœ… Encoded: [186, 185, 723, 31, 439] # โœ… Decoded: *CC(*)c1ccccc1C(=O)OCCCCCC # # ๐Ÿ” Decoding 5 tokens: # [000] ID= 186 โ†’ '*CC(*)' # [001] ID= 185 โ†’ 'c1cccc' # [002] ID= 723 โ†’ 'c1' # [003] ID= 31 โ†’ 'C(=O)OCC' # [004] ID= 439 โ†’ 'CCCC' ``` ## ๐Ÿ“ฆ Installation & Usage 0. Make sure you have all the reqs packages, possibly can be run with different versions 1. Clone this repository to a directory 2. Load with: ```python from FastChemTokenizer import FastChemTokenizer tokenizer = FastChemTokenizer.from_pretrained("./smitok_core") ``` 3. Use like any Hugging Face tokenizer: ```python outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512) ``` ## ๐Ÿ“š Models using this tokenizer: - [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE) - [ChemMiniQ3-SAbRLo](https://huggingface.co/gbyuvd/ChemMiniQ3-SAbRLo) ## ๐Ÿ“š Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling] Using `benchmark_simpler.py`: 1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc) Latent Space Visualization based on SMILES Interpolation Validity ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/sfzBvmJR-ovjpe5F7vNR4.png) using smitok (with tails) ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/-TusjDSYv9J3K-pfb0hqu.png) ```text Train: 13017 Val: 1627 Test: 1628 === Benchmarking ChemBERTa === vocab_size : 767 avg_tokens_per_mol : 25.0359 compression_ratio : 1.3766 percent_unknown : 0.0000 encode_throughput_smiles_per_sec : 4585.2022 decode_throughput_smiles_per_sec : 18168.2779 decode_reconstruction_accuracy : 100.0000 === Benchmarking FastChemTokenizerHF === vocab_size : 1238 avg_tokens_per_mol : 13.5668 compression_ratio : 2.5403 percent_unknown : 0.0000 encode_throughput_smiles_per_sec : 32005.8686 decode_throughput_smiles_per_sec : 29807.3610 decode_reconstruction_accuracy : 100.0000 ``` ## ๐Ÿ”ง Contributing This project is an ongoing **experiment** โ€” all contributions are welcome! - ๐Ÿง  Have a better way to implement the methods? - ๐Ÿ“Š Want to add evaluation metrics? - โœจ Found a bug? Please open an issue! ๐Ÿ‘‰ Please: - Keep changes minimal and focused. - Add comments if you change core logic. ## โš ๏ธ Disclaimer > **This is NOT a production ready tokenizer.** > > - Built during late-night prototyping sessions ๐ŸŒ™ > - Not yet validated on downstream task > - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon! > - Iโ€™m still learning ML/AI~ > ## โœ๏ธ On-going - [x] Redo evaluation with proper metrics and CI - [>] Validation on VAE and Causal LM Transformer - [x] Finish vocab construction on SELFIES - [>] Write technical report on methods, results ## ๐Ÿ“„ License Apache 2.0 ## ๐Ÿ™ Credits - Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5) - Built for efficiency - Code & fragments vocab by gbyuvd ## References ### BibTeX #### COCONUTDB ```bibtex @article{sorokina2021coconut, title={COCONUT online: Collection of Open Natural Products database}, author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph}, journal={Journal of Cheminformatics}, volume={13}, number={1}, pages={2}, year={2021}, doi={10.1186/s13321-020-00478-9} } ``` #### ChemBL34 ```bibtex @article{zdrazil2023chembl, title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods}, author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R}, journal={Nucleic Acids Research}, year={2023}, volume={gkad1004}, doi={10.1093/nar/gkad1004} } @misc{chembl34, title={ChemBL34}, year={2023}, doi={10.6019/CHEMBL.database.34} } ``` #### SuperNatural3 ```bibtex @article{Gallo2023, author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P}, title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}}, journal = {Nucleic Acids Research}, year = {2023}, month = jan, day = {6}, volume = {51}, number = {D1}, pages = {D654-D659}, doi = {10.1093/nar/gkac1008} } ``` ---