gbyuvd
/

FastChemTokenizer

@@ -8,7 +8,6 @@ tags:
 - tokenizer
 ---
 # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
 > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
@@ -74,11 +73,16 @@ Core's vocab length = 781 (after pruning)
 - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
 - **Memory Efficient**: Trie traversal and cache
-**for SMILES**
 ```python
 from FastChemTokenizer import FastChemTokenizer
-tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
 benzene = "c1ccccc1"
 encoded = tokenizer.encode(benzene)
 print("✅ Encoded:", encoded)
@@ -86,19 +90,22 @@ decoded = tokenizer.decode(encoded)
 print("✅ Decoded:", decoded)
 tokenizer.decode_with_trace(encoded)
-# ✅ Encoded: [489, 640]
 # ✅ Decoded: c1ccccc1
-# 🔍 Decoding 2 tokens:
-#  [000] ID=  489 → 'c1ccc'
-#  [001] ID=  640 → 'cc1'
 ```
 **for SELFIES**
 ```python
-from FastChemTokenizer import FastChemTokenizerSelfies
-tokenizer = FastChemTokenizerSelfies.from_pretrained("./selftok_wtails") # change to *_core for w/o tails
 benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
 encoded = tokenizer.encode(benzene)
 print("✅ Encoded:", encoded)
@@ -106,11 +113,16 @@ decoded = tokenizer.decode(encoded)
 print("✅ Decoded:", decoded)
 tokenizer.decode_with_trace(encoded)
-# ✅ Encoded: [70]
-# ✅ Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]
-# 🔍 Decoding 1 tokens:
-#  [000] ID=   70 → '[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]'
 ```
 ## 📦 Installation & Usage
@@ -129,20 +141,47 @@ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True
 ```
 ## 📚 Models using this tokenizer:
-- [ChemMiniQ3-HoriFIE](https://huggingface.co/gbyuvd/ChemMiniQ3-HoriFIE)
 ## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
 1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
-Planned: 50K samples, 2 epoch
-Latent Space Visualization based on SMILES Interpolation Validity
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/k2a58YUA_gAEF-YBCTs9W.png)
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/ZwhWS1sJ6MbMewTTC_rVI.png)
 ## 🔧 Contributing
 This project is an ongoing **experiment** — all contributions are welcome!
@@ -232,16 +271,4 @@ Apache 2.0
 }
 ```

 - tokenizer
 ---
 # 🧪 FastChemTokenizer — A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
 > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
 - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
 - **Memory Efficient**: Trie traversal and cache
+**for SMILES (core backbone vocabs without tails)**
+for with tails, use `./smitok`
+if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF`
 ```python
 from FastChemTokenizer import FastChemTokenizer
+tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
 benzene = "c1ccccc1"
 encoded = tokenizer.encode(benzene)
 print("✅ Encoded:", encoded)
 print("✅ Decoded:", decoded)
 tokenizer.decode_with_trace(encoded)
+# ✅ Encoded: [271, 474, 840]
 # ✅ Decoded: c1ccccc1
+#
+# 🔍 Decoding 3 tokens:
+#   [000] ID=  271 → 'c1ccc'
+#   [001] ID=  474 → 'cc'
+#   [002] ID=  840 → '1'
 ```
 **for SELFIES**
 ```python
+from FastChemTokenizerHF import FastChemTokenizerSelfies
+tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
 benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
 encoded = tokenizer.encode(benzene)
 print("✅ Encoded:", encoded)
 print("✅ Decoded:", decoded)
 tokenizer.decode_with_trace(encoded)
+# ✅ Encoded: [0, 257, 640, 693, 402, 1]
+# ✅ Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>
+# 🔍 Decoding 6 tokens:
+#  [000] ID=    0 → '<s>'
+#  [001] ID=  257 → '[C] [=C] [C] [=C] [C]'
+#  [002] ID=  640 → '[=C]'
+#  [003] ID=  693 → '[Ring1]'
+#  [004] ID=  402 → '[=Branch1]'
+#  [005] ID=    1 → '</s>'
 ```
 ## 📦 Installation & Usage
 ```
 ## 📚 Models using this tokenizer:
+- [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
 ## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
 1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
+Planned: 8K samples, 10 epochs
+Latent Space Visualization based on SMILES Interpolation Validity
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/k2a58YUA_gAEF-YBCTs9W.png)
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/ZwhWS1sJ6MbMewTTC_rVI.png)
+```text
+Loaded 8106 SMILES (assumed pre-canonicalized)
+Validating SMILES with RDKit...
+After RDKit filtering: 8106 valid SMILES
+Train: 6484
+Val:   811
+Test:  811
+=== Benchmarking ChemBERTa ===
+vocab_size                         : 767
+avg_tokens_per_mol                 : 42.7383
+compression_ratio                  : 1.3739
+percent_unknown                    : 0.0000
+encode_throughput_smiles_per_sec   : 3844.2028
+decode_throughput_smiles_per_sec   : 15993.9616
+decode_reconstruction_accuracy     : 100.0000
+=== Benchmarking FastChemTokenizer ===
+vocab_size                         : 1238
+avg_tokens_per_mol                 : 21.8288
+compression_ratio                  : 2.6900
+percent_unknown                    : 0.0000
+encode_throughput_smiles_per_sec   : 37341.6694
+decode_throughput_smiles_per_sec   : 101864.6384
+decode_reconstruction_accuracy     : 100.0000
+```
 ## 🔧 Contributing
 This project is an ongoing **experiment** — all contributions are welcome!
 }
 ```
+---