gbyuvd commited on
Commit
f5b18a8
Β·
verified Β·
1 Parent(s): a3e71f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -3
README.md CHANGED
@@ -1,3 +1,178 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - smi
5
+ pipeline_tag: feature-extraction
6
+ tags:
7
+ - chemistry
8
+ - tokenizer
9
+ ---
10
+
11
+ # πŸ§ͺ FastChemTokenizer β€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
12
+
13
+ > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
14
+
15
+
16
+ ## πŸš€ Overview
17
+
18
+ `FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
19
+
20
+ Trained on ~2.7M valid SMILES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
21
+
22
+
23
+ ## ⚑ Performance Highlights
24
+
25
+ | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
26
+ |--------------------------------|-------------------|----------------------|---------------------|
27
+ | **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
28
+ | **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
29
+ | **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
30
+ | **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
31
+ | **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
32
+ | **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
33
+
34
+ βœ… **1.97x faster** than ChemBERTa
35
+ βœ… **1.50x faster** than gen-mlm-cismi-bert
36
+ βœ… **No indexing errors** (avoids >512 token sequences)
37
+ βœ… **Zero unknown tokens** on validation set
38
+
39
+
40
+
41
+ ## 🧩 Vocabulary
42
+
43
+ - **Final vocab size**: 1,238 tokens
44
+ - **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (`<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`)
45
+ - **Pruned**: 270 unused tokens (e.g., `'²'`, `'C@@H](O)['`, `'È'`)
46
+ - **Training corpus**: ~119M unigrams from ~3M SMILES sequences
47
+ - **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95
48
+
49
+
50
+ ## πŸ› οΈ Implementation
51
+
52
+ - **Algorithm**: Trie-based longest-prefix-match (no regex, no BPE)
53
+ - **Caching**: `@lru_cache` for repeated string encoding
54
+ - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
55
+ - **Memory Efficient**: No token set β€” pure trie traversal
56
+
57
+ ```python
58
+ from FastChemTokenizer import FastChemTokenizer
59
+
60
+ tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
61
+ benzene = "c1ccccc1"
62
+ encoded = tokenizer.encode(benzene)
63
+ print("βœ… Encoded:", encoded)
64
+ decoded = tokenizer.decode(encoded)
65
+ print("βœ… Decoded:", decoded)
66
+ tokenizer.decode_with_trace(encoded)
67
+
68
+ # βœ… Encoded: [489, 640]
69
+ # βœ… Decoded: c1ccccc1
70
+
71
+ # πŸ” Decoding 2 tokens:
72
+ # [000] ID= 489 β†’ 'c1ccc'
73
+ # [001] ID= 640 β†’ 'cc1'
74
+ ```
75
+
76
+
77
+ ## πŸ“¦ Installation & Usage
78
+
79
+ 1. Clone this repository to a directory
80
+ 2. Load with:
81
+ ```python
82
+ from FastChemTokenizer import FastChemTokenizer
83
+
84
+ tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
85
+ ```
86
+ 3. Use like any Hugging Face tokenizer:
87
+ ```python
88
+ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512)
89
+ ```
90
+
91
+
92
+ ## πŸ”§ Contributing
93
+
94
+ This project is an ongoing **experiment** β€” all contributions are welcome!
95
+
96
+ - 🧠 Have a better way to implement the methods?
97
+ - πŸ“Š Want to add evaluation metrics?
98
+ - ✨ Found a bug? Please open an issue!
99
+
100
+ πŸ‘‰ Please:
101
+ - Keep changes minimal and focused.
102
+ - Add comments if you change core logic.
103
+
104
+ ## ⚠️ Disclaimer
105
+
106
+ > **This is NOT a production ready tokenizer.**
107
+ >
108
+ > - Built during late-night prototyping sessions πŸŒ™
109
+ > - Not yet validated on downstream task
110
+ > - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon!
111
+ > - I’m still learning ML/AI~
112
+ >
113
+
114
+ ## ✍️ On-going
115
+ - [>] Validation on VAE and Causal LM Transformer
116
+ - [>] Finish vocab construction on SELFIES
117
+ - [ ] Write technical report on methods, results
118
+
119
+ ## πŸ“„ License
120
+
121
+ Apache 2.0
122
+
123
+
124
+ ## πŸ™ Credits
125
+
126
+ - Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5)
127
+ - Built for efficiency
128
+ - Code & fragments vocab by gbyuvd
129
+
130
+ ## References
131
+ ### BibTeX
132
+ #### COCONUTDB
133
+ ```bibtex
134
+ @article{sorokina2021coconut,
135
+ title={COCONUT online: Collection of Open Natural Products database},
136
+ author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
137
+ journal={Journal of Cheminformatics},
138
+ volume={13},
139
+ number={1},
140
+ pages={2},
141
+ year={2021},
142
+ doi={10.1186/s13321-020-00478-9}
143
+ }
144
+ ```
145
+
146
+ #### ChemBL34
147
+ ```bibtex
148
+ @article{zdrazil2023chembl,
149
+ title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
150
+ author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
151
+ journal={Nucleic Acids Research},
152
+ year={2023},
153
+ volume={gkad1004},
154
+ doi={10.1093/nar/gkad1004}
155
+ }
156
+
157
+ @misc{chembl34,
158
+ title={ChemBL34},
159
+ year={2023},
160
+ doi={10.6019/CHEMBL.database.34}
161
+ }
162
+ ```
163
+
164
+ #### SuperNatural3
165
+ ```bibtex
166
+ @article{Gallo2023,
167
+ author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
168
+ title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
169
+ journal = {Nucleic Acids Research},
170
+ year = {2023},
171
+ month = jan,
172
+ day = {6},
173
+ volume = {51},
174
+ number = {D1},
175
+ pages = {D654-D659},
176
+ doi = {10.1093/nar/gkac1008}
177
+ }
178
+ ```