Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,178 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- smi
|
| 5 |
+
pipeline_tag: feature-extraction
|
| 6 |
+
tags:
|
| 7 |
+
- chemistry
|
| 8 |
+
- tokenizer
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# π§ͺ FastChemTokenizer β A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
|
| 12 |
+
|
| 13 |
+
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
## π Overview
|
| 17 |
+
|
| 18 |
+
`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
|
| 19 |
+
|
| 20 |
+
Trained on ~2.7M valid SMILES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## β‘ Performance Highlights
|
| 24 |
+
|
| 25 |
+
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|
| 26 |
+
|--------------------------------|-------------------|----------------------|---------------------|
|
| 27 |
+
| **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
|
| 28 |
+
| **Avg sequence length** | **21.49 tokens** | 41.99 tokens | 50.57 tokens |
|
| 29 |
+
| **Throughput** | **12,448/sec** | 6,326/sec | 10,658/sec |
|
| 30 |
+
| **Peak memory usage** | **17.08 MB** | 259.45 MB | 387.43 MB |
|
| 31 |
+
| **UNK token rate** | **0.0000%** | 0.0000% | ~0.0000% (non-zero) |
|
| 32 |
+
| **1000 encodes (benchmark)** | **0.0029s** | 1.6598s | 0.5491s |
|
| 33 |
+
|
| 34 |
+
β
**1.97x faster** than ChemBERTa
|
| 35 |
+
β
**1.50x faster** than gen-mlm-cismi-bert
|
| 36 |
+
β
**No indexing errors** (avoids >512 token sequences)
|
| 37 |
+
β
**Zero unknown tokens** on validation set
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## π§© Vocabulary
|
| 42 |
+
|
| 43 |
+
- **Final vocab size**: 1,238 tokens
|
| 44 |
+
- **Includes**: 391 backbone motifs + 391 tail motifs + special tokens (`<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`)
|
| 45 |
+
- **Pruned**: 270 unused tokens (e.g., `'Β²'`, `'C@@H](O)['`, `'Γ'`)
|
| 46 |
+
- **Training corpus**: ~119M unigrams from ~3M SMILES sequences
|
| 47 |
+
- **Entropy-based filtering**: Internal entropy > 0.5, entropy reduction < 0.95
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
## π οΈ Implementation
|
| 51 |
+
|
| 52 |
+
- **Algorithm**: Trie-based longest-prefix-match (no regex, no BPE)
|
| 53 |
+
- **Caching**: `@lru_cache` for repeated string encoding
|
| 54 |
+
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
|
| 55 |
+
- **Memory Efficient**: No token set β pure trie traversal
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
from FastChemTokenizer import FastChemTokenizer
|
| 59 |
+
|
| 60 |
+
tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
|
| 61 |
+
benzene = "c1ccccc1"
|
| 62 |
+
encoded = tokenizer.encode(benzene)
|
| 63 |
+
print("β
Encoded:", encoded)
|
| 64 |
+
decoded = tokenizer.decode(encoded)
|
| 65 |
+
print("β
Decoded:", decoded)
|
| 66 |
+
tokenizer.decode_with_trace(encoded)
|
| 67 |
+
|
| 68 |
+
# β
Encoded: [489, 640]
|
| 69 |
+
# β
Decoded: c1ccccc1
|
| 70 |
+
|
| 71 |
+
# π Decoding 2 tokens:
|
| 72 |
+
# [000] ID= 489 β 'c1ccc'
|
| 73 |
+
# [001] ID= 640 β 'cc1'
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## π¦ Installation & Usage
|
| 78 |
+
|
| 79 |
+
1. Clone this repository to a directory
|
| 80 |
+
2. Load with:
|
| 81 |
+
```python
|
| 82 |
+
from FastChemTokenizer import FastChemTokenizer
|
| 83 |
+
|
| 84 |
+
tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
|
| 85 |
+
```
|
| 86 |
+
3. Use like any Hugging Face tokenizer:
|
| 87 |
+
```python
|
| 88 |
+
outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True, max_length=512)
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
## π§ Contributing
|
| 93 |
+
|
| 94 |
+
This project is an ongoing **experiment** β all contributions are welcome!
|
| 95 |
+
|
| 96 |
+
- π§ Have a better way to implement the methods?
|
| 97 |
+
- π Want to add evaluation metrics?
|
| 98 |
+
- β¨ Found a bug? Please open an issue!
|
| 99 |
+
|
| 100 |
+
π Please:
|
| 101 |
+
- Keep changes minimal and focused.
|
| 102 |
+
- Add comments if you change core logic.
|
| 103 |
+
|
| 104 |
+
## β οΈ Disclaimer
|
| 105 |
+
|
| 106 |
+
> **This is NOT a production ready tokenizer.**
|
| 107 |
+
>
|
| 108 |
+
> - Built during late-night prototyping sessions π
|
| 109 |
+
> - Not yet validated on downstream task
|
| 110 |
+
> - Some methods in fragment building are heuristic and unproven, the technical report and code for them will be released soon!
|
| 111 |
+
> - Iβm still learning ML/AI~
|
| 112 |
+
>
|
| 113 |
+
|
| 114 |
+
## βοΈ On-going
|
| 115 |
+
- [>] Validation on VAE and Causal LM Transformer
|
| 116 |
+
- [>] Finish vocab construction on SELFIES
|
| 117 |
+
- [ ] Write technical report on methods, results
|
| 118 |
+
|
| 119 |
+
## π License
|
| 120 |
+
|
| 121 |
+
Apache 2.0
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
## π Credits
|
| 125 |
+
|
| 126 |
+
- Inspired by [ChemFIE project](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel), [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/), [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece), and [Tseng _et al_. 2024](https://openreview.net/forum?id=eR9C6c76j5)
|
| 127 |
+
- Built for efficiency
|
| 128 |
+
- Code & fragments vocab by gbyuvd
|
| 129 |
+
|
| 130 |
+
## References
|
| 131 |
+
### BibTeX
|
| 132 |
+
#### COCONUTDB
|
| 133 |
+
```bibtex
|
| 134 |
+
@article{sorokina2021coconut,
|
| 135 |
+
title={COCONUT online: Collection of Open Natural Products database},
|
| 136 |
+
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
|
| 137 |
+
journal={Journal of Cheminformatics},
|
| 138 |
+
volume={13},
|
| 139 |
+
number={1},
|
| 140 |
+
pages={2},
|
| 141 |
+
year={2021},
|
| 142 |
+
doi={10.1186/s13321-020-00478-9}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
#### ChemBL34
|
| 147 |
+
```bibtex
|
| 148 |
+
@article{zdrazil2023chembl,
|
| 149 |
+
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
|
| 150 |
+
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
|
| 151 |
+
journal={Nucleic Acids Research},
|
| 152 |
+
year={2023},
|
| 153 |
+
volume={gkad1004},
|
| 154 |
+
doi={10.1093/nar/gkad1004}
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
@misc{chembl34,
|
| 158 |
+
title={ChemBL34},
|
| 159 |
+
year={2023},
|
| 160 |
+
doi={10.6019/CHEMBL.database.34}
|
| 161 |
+
}
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
#### SuperNatural3
|
| 165 |
+
```bibtex
|
| 166 |
+
@article{Gallo2023,
|
| 167 |
+
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
|
| 168 |
+
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
|
| 169 |
+
journal = {Nucleic Acids Research},
|
| 170 |
+
year = {2023},
|
| 171 |
+
month = jan,
|
| 172 |
+
day = {6},
|
| 173 |
+
volume = {51},
|
| 174 |
+
number = {D1},
|
| 175 |
+
pages = {D654-D659},
|
| 176 |
+
doi = {10.1093/nar/gkac1008}
|
| 177 |
+
}
|
| 178 |
+
```
|