gbyuvd commited on
Commit
2389ec1
Β·
verified Β·
1 Parent(s): 1c27d17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -30
README.md CHANGED
@@ -8,7 +8,6 @@ tags:
8
  - tokenizer
9
  ---
10
 
11
-
12
  # πŸ§ͺ FastChemTokenizer β€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
13
 
14
  > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
@@ -74,11 +73,16 @@ Core's vocab length = 781 (after pruning)
74
  - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
75
  - **Memory Efficient**: Trie traversal and cache
76
 
77
- **for SMILES**
 
 
 
 
 
78
  ```python
79
  from FastChemTokenizer import FastChemTokenizer
80
 
81
- tokenizer = FastChemTokenizer.from_pretrained("./chemtok")
82
  benzene = "c1ccccc1"
83
  encoded = tokenizer.encode(benzene)
84
  print("βœ… Encoded:", encoded)
@@ -86,19 +90,22 @@ decoded = tokenizer.decode(encoded)
86
  print("βœ… Decoded:", decoded)
87
  tokenizer.decode_with_trace(encoded)
88
 
89
- # βœ… Encoded: [489, 640]
90
  # βœ… Decoded: c1ccccc1
 
 
 
 
 
 
91
 
92
- # πŸ” Decoding 2 tokens:
93
- # [000] ID= 489 β†’ 'c1ccc'
94
- # [001] ID= 640 β†’ 'cc1'
95
  ```
96
 
97
  **for SELFIES**
98
  ```python
99
- from FastChemTokenizer import FastChemTokenizerSelfies
100
 
101
- tokenizer = FastChemTokenizerSelfies.from_pretrained("./selftok_wtails") # change to *_core for w/o tails
102
  benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
103
  encoded = tokenizer.encode(benzene)
104
  print("βœ… Encoded:", encoded)
@@ -106,11 +113,16 @@ decoded = tokenizer.decode(encoded)
106
  print("βœ… Decoded:", decoded)
107
  tokenizer.decode_with_trace(encoded)
108
 
109
- # βœ… Encoded: [70]
110
- # βœ… Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]
111
 
112
- # πŸ” Decoding 1 tokens:
113
- # [000] ID= 70 β†’ '[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]'
 
 
 
 
 
114
  ```
115
 
116
  ## πŸ“¦ Installation & Usage
@@ -129,20 +141,47 @@ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True
129
  ```
130
 
131
  ## πŸ“š Models using this tokenizer:
132
- - [ChemMiniQ3-HoriFIE](https://huggingface.co/gbyuvd/ChemMiniQ3-HoriFIE)
 
133
 
134
  ## πŸ“š Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
135
  1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
136
 
137
- Planned: 50K samples, 2 epoch
138
-
139
- Latent Space Visualization based on SMILES Interpolation Validity
140
 
 
141
 
142
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/k2a58YUA_gAEF-YBCTs9W.png)
143
 
144
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/ZwhWS1sJ6MbMewTTC_rVI.png)
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ## πŸ”§ Contributing
147
 
148
  This project is an ongoing **experiment** β€” all contributions are welcome!
@@ -232,16 +271,4 @@ Apache 2.0
232
  }
233
  ```
234
 
235
-
236
-
237
-
238
-
239
-
240
-
241
-
242
-
243
-
244
-
245
-
246
-
247
-
 
8
  - tokenizer
9
  ---
10
 
 
11
  # πŸ§ͺ FastChemTokenizer β€” A High-Performance SMILES Tokenizer built via Info-Theoretic Motif Mining
12
 
13
  > **Optimized for chemical language modeling. 2x faster, 50% shorter sequences, minimal memory. Built with entropy-guided n-gram selection.**
 
73
  - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
74
  - **Memory Efficient**: Trie traversal and cache
75
 
76
+ **for SMILES (core backbone vocabs without tails)**
77
+
78
+ for with tails, use `./smitok`
79
+
80
+ if you want to use HF compat tokenizer (still in devel), please use `FastChemTokenizerHF`
81
+
82
  ```python
83
  from FastChemTokenizer import FastChemTokenizer
84
 
85
+ tokenizer = FastChemTokenizer.from_pretrained("../smitok_core")
86
  benzene = "c1ccccc1"
87
  encoded = tokenizer.encode(benzene)
88
  print("βœ… Encoded:", encoded)
 
90
  print("βœ… Decoded:", decoded)
91
  tokenizer.decode_with_trace(encoded)
92
 
93
+ # βœ… Encoded: [271, 474, 840]
94
  # βœ… Decoded: c1ccccc1
95
+ #
96
+ # πŸ” Decoding 3 tokens:
97
+ # [000] ID= 271 β†’ 'c1ccc'
98
+ # [001] ID= 474 β†’ 'cc'
99
+ # [002] ID= 840 β†’ '1'
100
+
101
 
 
 
 
102
  ```
103
 
104
  **for SELFIES**
105
  ```python
106
+ from FastChemTokenizerHF import FastChemTokenizerSelfies
107
 
108
+ tokenizer = FastChemTokenizerSelfies.from_pretrained("../selftok_core") # change to *_core for w/o tails
109
  benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
110
  encoded = tokenizer.encode(benzene)
111
  print("βœ… Encoded:", encoded)
 
113
  print("βœ… Decoded:", decoded)
114
  tokenizer.decode_with_trace(encoded)
115
 
116
+ # βœ… Encoded: [0, 257, 640, 693, 402, 1]
117
+ # βœ… Decoded: <s> [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1] </s>
118
 
119
+ # πŸ” Decoding 6 tokens:
120
+ # [000] ID= 0 β†’ '<s>'
121
+ # [001] ID= 257 β†’ '[C] [=C] [C] [=C] [C]'
122
+ # [002] ID= 640 β†’ '[=C]'
123
+ # [003] ID= 693 β†’ '[Ring1]'
124
+ # [004] ID= 402 β†’ '[=Branch1]'
125
+ # [005] ID= 1 β†’ '</s>'
126
  ```
127
 
128
  ## πŸ“¦ Installation & Usage
 
141
  ```
142
 
143
  ## πŸ“š Models using this tokenizer:
144
+ - [ChemMiniQ3-HoriFIE](https://github.com/gbyuvd/ChemMiniQ3-HoriFIE)
145
+
146
 
147
  ## πŸ“š Early VAE Evaluation (vs. ChemBERTa's) [WIP: STILL AT 8K SAMPLES and 1 EPOCH]
148
  1st Epoch, on 8K samples; embed_dim=256, hidden_dim=512, latent_dim=128, num_layers=2; batch_size= 16 * 4 (grad acc)
149
 
150
+ Planned: 8K samples, 10 epochs
 
 
151
 
152
+ Latent Space Visualization based on SMILES Interpolation Validity
153
 
154
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/k2a58YUA_gAEF-YBCTs9W.png)
155
 
156
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/ZwhWS1sJ6MbMewTTC_rVI.png)
157
 
158
+ ```text
159
+ Loaded 8106 SMILES (assumed pre-canonicalized)
160
+ Validating SMILES with RDKit...
161
+ After RDKit filtering: 8106 valid SMILES
162
+ Train: 6484
163
+ Val: 811
164
+ Test: 811
165
+
166
+ === Benchmarking ChemBERTa ===
167
+ vocab_size : 767
168
+ avg_tokens_per_mol : 42.7383
169
+ compression_ratio : 1.3739
170
+ percent_unknown : 0.0000
171
+ encode_throughput_smiles_per_sec : 3844.2028
172
+ decode_throughput_smiles_per_sec : 15993.9616
173
+ decode_reconstruction_accuracy : 100.0000
174
+
175
+ === Benchmarking FastChemTokenizer ===
176
+ vocab_size : 1238
177
+ avg_tokens_per_mol : 21.8288
178
+ compression_ratio : 2.6900
179
+ percent_unknown : 0.0000
180
+ encode_throughput_smiles_per_sec : 37341.6694
181
+ decode_throughput_smiles_per_sec : 101864.6384
182
+ decode_reconstruction_accuracy : 100.0000
183
+ ```
184
+
185
  ## πŸ”§ Contributing
186
 
187
  This project is an ongoing **experiment** β€” all contributions are welcome!
 
271
  }
272
  ```
273
 
274
+ ---