Update README.md
Browse files
README.md
CHANGED
|
@@ -15,14 +15,15 @@ tags:
|
|
| 15 |
|
| 16 |
## π Overview
|
| 17 |
|
| 18 |
-
`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
|
| 19 |
|
| 20 |
-
Trained on ~2.7M valid SMILES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
|
| 21 |
|
| 22 |
For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
|
| 23 |
|
| 24 |
## β‘ Performance Highlights
|
| 25 |
|
|
|
|
| 26 |
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|
| 27 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 28 |
| **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
|
|
@@ -34,10 +35,26 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
|
|
| 34 |
|
| 35 |
β
**1.97x faster** than ChemBERTa
|
| 36 |
β
**1.50x faster** than gen-mlm-cismi-bert
|
|
|
|
| 37 |
β
**No indexing errors** (avoids >512 token sequences)
|
| 38 |
β
**Zero unknown tokens** on validation set
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## π§© Vocabulary
|
| 43 |
|
|
@@ -55,6 +72,7 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
|
|
| 55 |
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
|
| 56 |
- **Memory Efficient**: Trie traversal and cache
|
| 57 |
|
|
|
|
| 58 |
```python
|
| 59 |
from FastChemTokenizer import FastChemTokenizer
|
| 60 |
|
|
@@ -74,9 +92,28 @@ tokenizer.decode_with_trace(encoded)
|
|
| 74 |
# [001] ID= 640 β 'cc1'
|
| 75 |
```
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
## π¦ Installation & Usage
|
| 79 |
|
|
|
|
| 80 |
1. Clone this repository to a directory
|
| 81 |
2. Load with:
|
| 82 |
```python
|
|
|
|
| 15 |
|
| 16 |
## π Overview
|
| 17 |
|
| 18 |
+
`FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
|
| 19 |
|
| 20 |
+
Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
|
| 21 |
|
| 22 |
For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
|
| 23 |
|
| 24 |
## β‘ Performance Highlights
|
| 25 |
|
| 26 |
+
#### SMILES
|
| 27 |
| Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
|
| 28 |
|--------------------------------|-------------------|----------------------|---------------------|
|
| 29 |
| **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
|
|
|
|
| 35 |
|
| 36 |
β
**1.97x faster** than ChemBERTa
|
| 37 |
β
**1.50x faster** than gen-mlm-cismi-bert
|
| 38 |
+
β
**~19x memory saving** compared to both of the above tokenizer
|
| 39 |
β
**No indexing errors** (avoids >512 token sequences)
|
| 40 |
β
**Zero unknown tokens** on validation set
|
| 41 |
|
| 42 |
+
#### SELFIES
|
| 43 |
+
```
|
| 44 |
+
Core's vocab length = 781 (after pruning)
|
| 45 |
+
with tails = 1161 (after pruning)
|
| 46 |
+
```
|
| 47 |
+
| Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
|
| 48 |
+
|--------------------------------|-------------------|----------------------|---------------------|
|
| 49 |
+
| **Avg time per SMILES** | 0.1548 ms | 0.1700 ms | **0.1170 ms** |
|
| 50 |
+
| **Avg sequence length** | **20.34 tokens** | 33.22 tokens | 53.98 tokens |
|
| 51 |
+
| **Throughput** | 6,461/sec | 5,882/sec | **8,549/sec** |
|
| 52 |
+
| **Peak memory usage** | **7.96 MB** | 19.77 MB | 488.03 MB |
|
| 53 |
+
| **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
|
| 54 |
+
| **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
|
| 55 |
|
| 56 |
+
β
Even though 1.32x slower, it produces **2.65x lesser tokens**
|
| 57 |
+
β
**~61x memory saving with tails** and **~25x** with core
|
| 58 |
|
| 59 |
## π§© Vocabulary
|
| 60 |
|
|
|
|
| 72 |
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
|
| 73 |
- **Memory Efficient**: Trie traversal and cache
|
| 74 |
|
| 75 |
+
**for SMILES**
|
| 76 |
```python
|
| 77 |
from FastChemTokenizer import FastChemTokenizer
|
| 78 |
|
|
|
|
| 92 |
# [001] ID= 640 β 'cc1'
|
| 93 |
```
|
| 94 |
|
| 95 |
+
**for SELFIES**
|
| 96 |
+
```python
|
| 97 |
+
from FastChemTokenizer import FastChemTokenizerSelfies
|
| 98 |
+
|
| 99 |
+
tokenizer = FastChemTokenizerSelfies.from_pretrained("./selftok_wtails") # change to *_core for w/o tails
|
| 100 |
+
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
|
| 101 |
+
encoded = tokenizer.encode(benzene)
|
| 102 |
+
print("β
Encoded:", encoded)
|
| 103 |
+
decoded = tokenizer.decode(encoded)
|
| 104 |
+
print("β
Decoded:", decoded)
|
| 105 |
+
tokenizer.decode_with_trace(encoded)
|
| 106 |
+
|
| 107 |
+
# β
Encoded: [70]
|
| 108 |
+
# β
Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]
|
| 109 |
+
|
| 110 |
+
# π Decoding 1 tokens:
|
| 111 |
+
# [000] ID= 70 β '[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]'
|
| 112 |
+
```
|
| 113 |
|
| 114 |
## π¦ Installation & Usage
|
| 115 |
|
| 116 |
+
0. Make sure you have all the reqs packages, possibly can be run with different versions
|
| 117 |
1. Clone this repository to a directory
|
| 118 |
2. Load with:
|
| 119 |
```python
|