File size: 12,174 Bytes
f5b18a8 5b9a060 f5b18a8 5b9a060 f5b18a8 0253e7c f5b18a8 2f8b5c6 f5b18a8 5b9a060 f5b18a8 926b879 0253e7c f5b18a8 0253e7c f5b18a8 5b9a060 15fb43c 926b879 5b9a060 f5b18a8 926b879 5b9a060 f5b18a8 1113c2e f5b18a8 d378b8a f5b18a8 d378b8a f5b18a8 2389ec1 f5b18a8 2389ec1 f5b18a8 2389ec1 f5b18a8 2389ec1 f5b18a8 5b9a060 07e938b 819b92d 5b9a060 2389ec1 5b9a060 2389ec1 5b9a060 2389ec1 5b9a060 2389ec1 5b9a060 f5b18a8 819b92d f5b18a8 5b9a060 f5b18a8 51ff3c2 f5b18a8 792afc0 2389ec1 918660a 2389ec1 792afc0 98bb8f5 7b739f9 2c5df2a 2389ec1 8b2a946 98bb8f5 8b2a946 98bb8f5 f5b18a8 2389ec1 e52bcf6 2389ec1 e52bcf6 2389ec1 e52bcf6 2389ec1 e52bcf6 2389ec1 e52bcf6 2389ec1 e52bcf6 2389ec1 f5b18a8 926b879 f5b18a8 0253e7c 918660a f5b18a8 d378b8a 0253e7c 2389ec1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- chemistry
- tokenizer
---
# π§ͺ FastChemTokenizer β A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
## π Overview
`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).
A tentative technical report can be read [here](https://amachinewithorgans.wordpress.com/2025/09/27/fastchemtokenizer-a-new-approach-to-chemical-language-processing-via-statistical-info-theoretic-motif-mining/)
## β‘ Performance Highlights
#### SMILES
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES** | **0.0692 Β± 0.0038 ms** | 0.1279 Β± 0.0090 ms | 0.1029 Β± 0.0038 ms |
| **Avg sequence length** | **21.61 Β± 0.70 tokens**| 42.23 Β± 1.55 tokens | 50.86 Β± 1.90 tokens |
| **Throughput** | **14,448/sec** | 7,817/sec | 9,720/sec |
| **Peak memory usage** | **12.92 MB** | 258.00 MB | 387.73 MB |
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
β
**1.97x faster** than ChemBERTa
β
**1.50x faster** than gen-mlm-cismi-bert
β
**~19x memory saving** compared to both of the above tokenizer
β
**No indexing errors** (avoids >512 token sequences)
β
**Zero unknown tokens** on validation set
#### SELFIES
```
Core's vocab length = 781 (after pruning)
with tails = 1161 (after pruning)
```
| Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES** | 0.1882 Β± 0.0140 ms| 0.1674 Β± 0.0093 ms | **0.1157 Β± 0.0095 ms**|
| **Avg sequence length** | **20.46 Β± 1.21 tokens** | 33.41 Β± 1.80 tokens | 54.29 Β± 3.08 tokens |
| **Throughput** | 5,313/sec | 5,973/sec | **8,642 /sec** |
| **Peak memory usage** | **9.32 MB** | 20.16 MB | 490.13 MB |
| **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
| **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
β
Even though 1.32x slower, it produces **2.65x less tokens**
- this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
β
**~61x memory saving with tails** and **~25x** with core
## π§© Vocabulary (SMILES)
- **Final vocab size**: 1,238 tokens
- **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (`<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`)
- **Pruned**: 270 unused tokens (e.g., `'Β²'`, `'C@@H](O)['`, `'Γ'`)
- **Training corpus**: ~119M unigrams from ~3M SMILES sequences
- **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95
## π οΈ Implementation
- **Algorithm**: Trie-based longest-prefix-match
- **Caching**: `@lru_cache` for repeated string encoding
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
- **Memory Efficient**: Trie traversal and cache
**for SMILES (core backbone vocabs without tails)**
for with tails, use `./smitok`
if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF`
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
benzene = "c1ccccc1"
encoded = tokenizer.encode(benzene)
print("β
Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("β
Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# β
Encoded: [271, 474, 840]
# β
Decoded: c1ccccc1
#
# π Decoding 3 tokens:
# [000] ID= 271 β 'c1ccc'
# [001] ID= 474 β 'cc'
# [002] ID= 840 β '1'
```
**for SELFIES**
Please don't use the old `FastChemTokenizer` for SELFIES, use the HF one
```python
from FastChemTokenizerHF import FastChemTokenizerSelfies
tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
encoded = tokenizer.encode(benzene)
print("β
Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("β
Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# β
Encoded: [0, 257, 640, 693, 402, 1]
# β
Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>
# π Decoding 6 tokens:
# [000] ID= 0 β '<s>'
# [001] ID= 257 β '[C] [=C] [C] [=C] [C]'
# [002] ID= 640 β '[=C]'
# [003] ID= 693 β '[Ring1]'
# [004] ID= 402 β '[=Branch1]'
# [005] ID= 1 β '</s>'
```
#### BigSMILES (experimental)
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("./bigsmiles-proto")
testentry = "*CC(*)c1ccccc1C(=O)OCCCCCC"
encoded = tokenizer.encode(testentry)
print("β
Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("β
Decoded:", decoded)
tokenizer.decode_with_trace(encoded)
# β
Encoded: [186, 185, 723, 31, 439]
# β
Decoded: *CC(*)c1ccccc1C(=O)OCCCCCC
#
# π Decoding 5 tokens:
# [000] ID= 186 β '*CC(*)'
# [001] ID= 185 β 'c1cccc'
# [002] ID= 723 β 'c1'
# [003] ID= 31 β 'C(=O)OCC'
# [004] ID= 439 β 'CCCC'
```
## π¦ Installation & Usage
0. Make sure you have all the reqs packages, possibly can be run with different versions
1. Clone this repository to a directory
2. Load with:
```python
from FastChemTokenizer import FastChemTokenizer
tokenizer = FastChemTokenizer.from_pretrained("./smitok_core")
```
3. Use like any Hugging Face tokenizer:
```python
outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512)
```
## π Models using this tokenizer:
- [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
- [ChemMiniQ3-SAbRLo](https://huggingface.co/gbyuvd/ChemMiniQ3-SAbRLo)
## π Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
Using `benchmark_simpler.py`: 1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc)
Latent Space Visualization based on SMILES Interpolation Validity

using smitok (with tails)

```text
Train: 13017
Val: 1627
Test: 1628
=== Benchmarking ChemBERTa ===
vocab_size : 767
avg_tokens_per_mol : 25.0359
compression_ratio : 1.3766
percent_unknown : 0.0000
encode_throughput_smiles_per_sec : 4585.2022
decode_throughput_smiles_per_sec : 18168.2779
decode_reconstruction_accuracy : 100.0000
=== Benchmarking FastChemTokenizerHF ===
vocab_size : 1238
avg_tokens_per_mol : 13.5668
compression_ratio : 2.5403
percent_unknown : 0.0000
encode_throughput_smiles_per_sec : 32005.8686
decode_throughput_smiles_per_sec : 29807.3610
decode_reconstruction_accuracy : 100.0000
```
## π§ Contributing
This project is an ongoing **experiment** β all contributions are welcome!
- π§ Have a better way to implement the methods?
- π Want to add evaluation metrics?
- β¨ Found a bug? Please open an issue!
π Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.
## β οΈ Disclaimer
> **This is NOT a production ready tokenizer.**
>
> - Built during late-night prototyping sessions π
> - Not yet validated on downstream task
> - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon!
> - Iβm still learning ML/AI~
>
## βοΈ On-going
- [x] Redo evaluation with proper metrics and CI
- [>] Validation on VAE and Causal LM Transformer
- [x] Finish vocab construction on SELFIES
- [>] Write technical report on methods, results
## π License
Apache 2.0
## π Credits
- Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5)
- Built for efficiency
- Code & fragments vocab by gbyuvd
## References
### BibTeX
#### COCONUTDB
```bibtex
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
```
#### ChemBL34
```bibtex
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
```
#### SuperNatural3
```bibtex
@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}
```
--- |