Update README.md
Browse files
README.md
CHANGED
|
@@ -8,7 +8,6 @@ tags:
|
|
| 8 |
- tokenizer
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
| 12 |
# π§ͺ FastChemTokenizer β A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
|
| 13 |
|
| 14 |
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
|
|
@@ -74,11 +73,16 @@ Core's vocab length = 781 (after pruning)
|
|
| 74 |
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
|
| 75 |
- **Memory Efficient**: Trie traversal and cache
|
| 76 |
|
| 77 |
-
**for SMILES**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
```python
|
| 79 |
from FastChemTokenizer import FastChemTokenizer
|
| 80 |
|
| 81 |
-
tokenizer = FastChemTokenizer.from_pretrained("./
|
| 82 |
benzene = "c1ccccc1"
|
| 83 |
encoded = tokenizer.encode(benzene)
|
| 84 |
print("β
Encoded:", encoded)
|
|
@@ -86,19 +90,22 @@ decoded = tokenizer.decode(encoded)
|
|
| 86 |
print("β
Decoded:", decoded)
|
| 87 |
tokenizer.decode_with_trace(encoded)
|
| 88 |
|
| 89 |
-
# β
Encoded: [
|
| 90 |
# β
Decoded: c1ccccc1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
# π Decoding 2 tokens:
|
| 93 |
-
# [000] ID= 489 β 'c1ccc'
|
| 94 |
-
# [001] ID= 640 β 'cc1'
|
| 95 |
```
|
| 96 |
|
| 97 |
**for SELFIES**
|
| 98 |
```python
|
| 99 |
-
from
|
| 100 |
|
| 101 |
-
tokenizer = FastChemTokenizerSelfies.from_pretrained("./
|
| 102 |
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
|
| 103 |
encoded = tokenizer.encode(benzene)
|
| 104 |
print("β
Encoded:", encoded)
|
|
@@ -106,11 +113,16 @@ decoded = tokenizer.decode(encoded)
|
|
| 106 |
print("β
Decoded:", decoded)
|
| 107 |
tokenizer.decode_with_trace(encoded)
|
| 108 |
|
| 109 |
-
# β
Encoded: [
|
| 110 |
-
# β
Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]
|
| 111 |
|
| 112 |
-
# π Decoding
|
| 113 |
-
# [000] ID=
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
```
|
| 115 |
|
| 116 |
## π¦ Installation & Usage
|
|
@@ -129,20 +141,47 @@ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True
|
|
| 129 |
```
|
| 130 |
|
| 131 |
## π Models using this tokenizer:
|
| 132 |
-
- [ChemMiniQ3-HoriFIE](https://
|
|
|
|
| 133 |
|
| 134 |
## π Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
|
| 135 |
1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
|
| 136 |
|
| 137 |
-
Planned:
|
| 138 |
-
|
| 139 |
-
Latent Space Visualization based on SMILES Interpolation Validity
|
| 140 |
|
|
|
|
| 141 |
|
| 142 |

|
| 143 |
|
| 144 |

|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
## π§ Contributing
|
| 147 |
|
| 148 |
This project is an ongoing **experiment** β all contributions are welcome!
|
|
@@ -232,16 +271,4 @@ Apache 2.0
|
|
| 232 |
}
|
| 233 |
```
|
| 234 |
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
|
|
|
| 8 |
- tokenizer
|
| 9 |
---
|
| 10 |
|
|
|
|
| 11 |
# π§ͺ FastChemTokenizer β A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
|
| 12 |
|
| 13 |
> **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
|
|
|
|
| 73 |
- **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
|
| 74 |
- **Memory Efficient**: Trie traversal and cache
|
| 75 |
|
| 76 |
+
**for SMILES (core backbone vocabs without tails)**
|
| 77 |
+
|
| 78 |
+
for with tails, use `./smitok`
|
| 79 |
+
|
| 80 |
+
if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF`
|
| 81 |
+
|
| 82 |
```python
|
| 83 |
from FastChemTokenizer import FastChemTokenizer
|
| 84 |
|
| 85 |
+
tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
|
| 86 |
benzene = "c1ccccc1"
|
| 87 |
encoded = tokenizer.encode(benzene)
|
| 88 |
print("β
Encoded:", encoded)
|
|
|
|
| 90 |
print("β
Decoded:", decoded)
|
| 91 |
tokenizer.decode_with_trace(encoded)
|
| 92 |
|
| 93 |
+
# β
Encoded: [271, 474, 840]
|
| 94 |
# β
Decoded: c1ccccc1
|
| 95 |
+
#
|
| 96 |
+
# π Decoding 3 tokens:
|
| 97 |
+
# [000] ID= 271 β 'c1ccc'
|
| 98 |
+
# [001] ID= 474 β 'cc'
|
| 99 |
+
# [002] ID= 840 β '1'
|
| 100 |
+
|
| 101 |
|
|
|
|
|
|
|
|
|
|
| 102 |
```
|
| 103 |
|
| 104 |
**for SELFIES**
|
| 105 |
```python
|
| 106 |
+
from FastChemTokenizerHF import FastChemTokenizerSelfies
|
| 107 |
|
| 108 |
+
tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
|
| 109 |
benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
|
| 110 |
encoded = tokenizer.encode(benzene)
|
| 111 |
print("β
Encoded:", encoded)
|
|
|
|
| 113 |
print("β
Decoded:", decoded)
|
| 114 |
tokenizer.decode_with_trace(encoded)
|
| 115 |
|
| 116 |
+
# β
Encoded: [0, 257, 640, 693, 402, 1]
|
| 117 |
+
# β
Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>
|
| 118 |
|
| 119 |
+
# π Decoding 6 tokens:
|
| 120 |
+
# [000] ID= 0 β '<s>'
|
| 121 |
+
# [001] ID= 257 β '[C] [=C] [C] [=C] [C]'
|
| 122 |
+
# [002] ID= 640 β '[=C]'
|
| 123 |
+
# [003] ID= 693 β '[Ring1]'
|
| 124 |
+
# [004] ID= 402 β '[=Branch1]'
|
| 125 |
+
# [005] ID= 1 β '</s>'
|
| 126 |
```
|
| 127 |
|
| 128 |
## π¦ Installation & Usage
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
## π Models using this tokenizer:
|
| 144 |
+
- [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
|
| 145 |
+
|
| 146 |
|
| 147 |
## π Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
|
| 148 |
1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
|
| 149 |
|
| 150 |
+
Planned: 8K samples, 10 epochs
|
|
|
|
|
|
|
| 151 |
|
| 152 |
+
Latent Space Visualization based on SMILES Interpolation Validity
|
| 153 |
|
| 154 |

|
| 155 |
|
| 156 |

|
| 157 |
|
| 158 |
+
```text
|
| 159 |
+
Loaded 8106 SMILES (assumed pre-canonicalized)
|
| 160 |
+
Validating SMILES with RDKit...
|
| 161 |
+
After RDKit filtering: 8106 valid SMILES
|
| 162 |
+
Train: 6484
|
| 163 |
+
Val: 811
|
| 164 |
+
Test: 811
|
| 165 |
+
|
| 166 |
+
=== Benchmarking ChemBERTa ===
|
| 167 |
+
vocab_size : 767
|
| 168 |
+
avg_tokens_per_mol : 42.7383
|
| 169 |
+
compression_ratio : 1.3739
|
| 170 |
+
percent_unknown : 0.0000
|
| 171 |
+
encode_throughput_smiles_per_sec : 3844.2028
|
| 172 |
+
decode_throughput_smiles_per_sec : 15993.9616
|
| 173 |
+
decode_reconstruction_accuracy : 100.0000
|
| 174 |
+
|
| 175 |
+
=== Benchmarking FastChemTokenizer ===
|
| 176 |
+
vocab_size : 1238
|
| 177 |
+
avg_tokens_per_mol : 21.8288
|
| 178 |
+
compression_ratio : 2.6900
|
| 179 |
+
percent_unknown : 0.0000
|
| 180 |
+
encode_throughput_smiles_per_sec : 37341.6694
|
| 181 |
+
decode_throughput_smiles_per_sec : 101864.6384
|
| 182 |
+
decode_reconstruction_accuracy : 100.0000
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
## π§ Contributing
|
| 186 |
|
| 187 |
This project is an ongoing **experiment** β all contributions are welcome!
|
|
|
|
| 271 |
}
|
| 272 |
```
|
| 273 |
|
| 274 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|