File size: 12,174 Bytes
f5b18a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b9a060
f5b18a8
5b9a060
f5b18a8
0253e7c
f5b18a8
2f8b5c6
 
f5b18a8
 
5b9a060
f5b18a8
 
926b879
 
 
 
0253e7c
f5b18a8
 
 
 
0253e7c
f5b18a8
 
 
5b9a060
 
 
 
 
 
 
15fb43c
926b879
 
 
5b9a060
 
f5b18a8
926b879
 
5b9a060
f5b18a8
1113c2e
f5b18a8
 
 
 
 
 
 
 
 
 
d378b8a
f5b18a8
 
d378b8a
f5b18a8
2389ec1
 
 
 
 
 
f5b18a8
 
 
2389ec1
f5b18a8
 
 
 
 
 
 
2389ec1
f5b18a8
2389ec1
 
 
 
 
 
f5b18a8
 
 
5b9a060
07e938b
819b92d
 
5b9a060
2389ec1
5b9a060
2389ec1
5b9a060
 
 
 
 
 
 
2389ec1
 
5b9a060
2389ec1
 
 
 
 
 
 
5b9a060
f5b18a8
819b92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5b18a8
 
5b9a060
f5b18a8
 
 
 
 
51ff3c2
f5b18a8
 
 
 
 
 
792afc0
2389ec1
918660a
2389ec1
792afc0
98bb8f5
7b739f9
2c5df2a
2389ec1
8b2a946
98bb8f5
 
 
8b2a946
98bb8f5
f5b18a8
2389ec1
e52bcf6
 
 
2389ec1
 
 
e52bcf6
 
2389ec1
e52bcf6
 
2389ec1
 
e52bcf6
2389ec1
e52bcf6
 
2389ec1
e52bcf6
 
2389ec1
 
 
f5b18a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
926b879
f5b18a8
0253e7c
918660a
f5b18a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d378b8a
0253e7c
2389ec1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- chemistry
- tokenizer
---

# πŸ§ͺ FastChemTokenizer β€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining

> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**


## πŸš€ Overview

`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage.

Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.

The "comb_smi.csv" dataset can be downloaded [here](https://huggingface.co/datasets/gbyuvd/bioactives-naturals-smiles-molgen).

A tentative technical report can be read [here](https://amachinewithorgans.wordpress.com/2025/09/27/fastchemtokenizer-a-new-approach-to-chemical-language-processing-via-statistical-info-theoretic-motif-mining/)

## ⚑ Performance Highlights

#### SMILES
| Metric                          | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES**        | **0.0692 Β± 0.0038 ms**  | 0.1279 Β± 0.0090 ms   | 0.1029 Β± 0.0038  ms |
| **Avg sequence length**        | **21.61 Β± 0.70 tokens**| 42.23 Β± 1.55 tokens  | 50.86 Β± 1.90 tokens |
| **Throughput**                 | **14,448/sec**    | 7,817/sec            | 9,720/sec           |
| **Peak memory usage**          | **12.92 MB**      | 258.00 MB            | 387.73 MB           |
| **UNK token rate**             | **0.0000%**       | 0.0000%              | ~0.0000% (non-zero) |
| **1000 encodes (benchmark)**   | **0.0029s**       | 1.6598s              | 0.5491s             |

βœ… **1.97x faster** than ChemBERTa  
βœ… **1.50x faster** than gen-mlm-cismi-bert  
βœ… **~19x memory saving** compared to both of the above tokenizer  
βœ… **No indexing errors** (avoids >512 token sequences)  
βœ… **Zero unknown tokens** on validation set

#### SELFIES
```
Core's vocab length = 781 (after pruning) 
        with tails = 1161 (after pruning) 
```
| Metric                         | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|--------------------------------|-------------------|----------------------|---------------------|
| **Avg time per SMILES**        | 0.1882 Β± 0.0140 ms| 0.1674 Β± 0.0093 ms   | **0.1157 Β± 0.0095 ms**|
| **Avg sequence length**        | **20.46 Β± 1.21 tokens**  | 33.41  Β± 1.80 tokens | 54.29 Β± 3.08 tokens |
| **Throughput**                 | 5,313/sec         | 5,973/sec            | **8,642 /sec**      |
| **Peak memory usage**          | **9.32 MB**       | 20.16 MB             | 490.13 MB           |
| **UNK token rate**             | **0.0000%**       | 0.0000%              | 0.0000%             |
| **1000 encodes (benchmark)**   | **0.0081s**       | 2.9020s              | 2.9020s             |

βœ… Even though 1.32x slower, it produces **2.65x less tokens**   
        - this slow down could be related with searching based on a lot of whitespaces between the formated SELFIES strings
βœ… **~61x memory saving with tails** and **~25x** with core

## 🧩 Vocabulary (SMILES)

- **Final vocab size**: 1,238 tokens
- **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (`<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`)
- **Pruned**: 270 unused tokens (e.g., `'²'`, `'C@@H](O)['`, `'È'`)
- **Training corpus**: ~119M unigrams from ~3M SMILES sequences
- **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95


## πŸ› οΈ Implementation

- **Algorithm**: Trie-based longest-prefix-match 
- **Caching**: `@lru_cache` for repeated string encoding
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
- **Memory Efficient**: Trie traversal and cache

**for SMILES (core backbone vocabs without tails)** 

for with tails, use `./smitok` 

if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF` 

```python
from FastChemTokenizer import FastChemTokenizer

tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
benzene = "c1ccccc1"
encoded = tokenizer.encode(benzene)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)

# βœ… Encoded: [271, 474, 840]
# βœ… Decoded: c1ccccc1
# 
# πŸ” Decoding 3 tokens:
#   [000] ID=  271 β†’ 'c1ccc'
#   [001] ID=  474 β†’ 'cc'
#   [002] ID=  840 β†’ '1'


```

**for SELFIES**

Please don't use the old `FastChemTokenizer` for SELFIES, use the HF one

```python
from FastChemTokenizerHF import FastChemTokenizerSelfies

tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
encoded = tokenizer.encode(benzene)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)

# βœ… Encoded: [0, 257, 640, 693, 402, 1]
# βœ… Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>

# πŸ” Decoding 6 tokens:
#  [000] ID=    0 β†’ '<s>'
#  [001] ID=  257 β†’ '[C] [=C] [C] [=C] [C]'
#  [002] ID=  640 β†’ '[=C]'
#  [003] ID=  693 β†’ '[Ring1]'
#  [004] ID=  402 β†’ '[=Branch1]'
#  [005] ID=    1 β†’ '</s>'
```

#### BigSMILES (experimental)
```python
from FastChemTokenizer import FastChemTokenizer

tokenizer = FastChemTokenizer.from_pretrained("./bigsmiles-proto") 
testentry = "*CC(*)c1ccccc1C(=O)OCCCCCC"
encoded = tokenizer.encode(testentry)
print("βœ… Encoded:", encoded)
decoded = tokenizer.decode(encoded)
print("βœ… Decoded:", decoded)
tokenizer.decode_with_trace(encoded)

# βœ… Encoded: [186, 185, 723, 31, 439]
# βœ… Decoded: *CC(*)c1ccccc1C(=O)OCCCCCC
# 
# πŸ” Decoding 5 tokens:
#   [000] ID=  186 β†’ '*CC(*)'
#   [001] ID=  185 β†’ 'c1cccc'
#   [002] ID=  723 β†’ 'c1'
#   [003] ID=   31 β†’ 'C(=O)OCC'
#   [004] ID=  439 β†’ 'CCCC'
```

## πŸ“¦ Installation & Usage

0. Make sure you have all the reqs packages, possibly can be run with different versions
1. Clone this repository to a directory
2. Load with:
```python
from FastChemTokenizer import FastChemTokenizer

tokenizer = FastChemTokenizer.from_pretrained("./smitok_core")
```
3. Use like any Hugging Face tokenizer:
```python
outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512)
```

## πŸ“š Models using this tokenizer:
- [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
- [ChemMiniQ3-SAbRLo](https://huggingface.co/gbyuvd/ChemMiniQ3-SAbRLo)


## πŸ“š Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
Using `benchmark_simpler.py`: 1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc) 

Latent Space Visualization based on SMILES Interpolation Validity   

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/sfzBvmJR-ovjpe5F7vNR4.png)

using smitok (with tails)

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/-TusjDSYv9J3K-pfb0hqu.png)

```text
Train: 13017
Val:   1627
Test:  1628

=== Benchmarking ChemBERTa ===
vocab_size                         : 767
avg_tokens_per_mol                 : 25.0359
compression_ratio                  : 1.3766
percent_unknown                    : 0.0000
encode_throughput_smiles_per_sec   : 4585.2022
decode_throughput_smiles_per_sec   : 18168.2779
decode_reconstruction_accuracy     : 100.0000

=== Benchmarking FastChemTokenizerHF ===
vocab_size                         : 1238
avg_tokens_per_mol                 : 13.5668
compression_ratio                  : 2.5403
percent_unknown                    : 0.0000
encode_throughput_smiles_per_sec   : 32005.8686
decode_throughput_smiles_per_sec   : 29807.3610
decode_reconstruction_accuracy     : 100.0000
```

## πŸ”§ Contributing

This project is an ongoing **experiment** β€” all contributions are welcome!

- 🧠 Have a better way to implement the methods?
- πŸ“Š Want to add evaluation metrics?
- ✨ Found a bug? Please open an issue!

πŸ‘‰ Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.

## ⚠️ Disclaimer

> **This is NOT a production ready tokenizer.**  
>  
> - Built during late-night prototyping sessions πŸŒ™  
> - Not yet validated on downstream task
> - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon!
> - I’m still learning ML/AI~ 
> 

## ✍️ On-going
- [x] Redo evaluation with proper metrics and CI
- [>] Validation on VAE and Causal LM Transformer
- [x] Finish vocab construction on SELFIES
- [>] Write technical report on methods, results

## πŸ“„ License

Apache 2.0


## πŸ™ Credits

- Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5)
- Built for efficiency
- Code & fragments vocab by gbyuvd

## References
### BibTeX
#### COCONUTDB
```bibtex
@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}
```

#### ChemBL34
```bibtex
@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}
```

#### SuperNatural3
```bibtex
@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}
```

---