gbyuvd commited on
Commit
e52bcf6
·
verified ·
1 Parent(s): 70ecb45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -16
README.md CHANGED
@@ -145,7 +145,7 @@ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True
145
 
146
 
147
  ## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
148
- 1st Epoch, on ~14K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc)
149
 
150
  Latent Space Visualization based on SMILES Interpolation Validity
151
 
@@ -156,29 +156,26 @@ using smitok (with tails)
156
  ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/-TusjDSYv9J3K-pfb0hqu.png)
157
 
158
  ```text
159
- Loaded 8106 SMILES (assumed pre-canonicalized)
160
- Validating SMILES with RDKit...
161
- After RDKit filtering: 8106 valid SMILES
162
- Train: 6484
163
- Val: 811
164
- Test: 811
165
 
166
  === Benchmarking ChemBERTa ===
167
  vocab_size : 767
168
- avg_tokens_per_mol : 42.7383
169
- compression_ratio : 1.3739
170
  percent_unknown : 0.0000
171
- encode_throughput_smiles_per_sec : 3844.2028
172
- decode_throughput_smiles_per_sec : 15993.9616
173
  decode_reconstruction_accuracy : 100.0000
174
 
175
- === Benchmarking FastChemTokenizer ===
176
  vocab_size : 1238
177
- avg_tokens_per_mol : 21.8288
178
- compression_ratio : 2.6900
179
  percent_unknown : 0.0000
180
- encode_throughput_smiles_per_sec : 37341.6694
181
- decode_throughput_smiles_per_sec : 101864.6384
182
  decode_reconstruction_accuracy : 100.0000
183
  ```
184
 
 
145
 
146
 
147
  ## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
148
+ 1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc)
149
 
150
  Latent Space Visualization based on SMILES Interpolation Validity
151
 
 
156
  ![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/-TusjDSYv9J3K-pfb0hqu.png)
157
 
158
  ```text
159
+ Train: 13017
160
+ Val: 1627
161
+ Test: 1628
 
 
 
162
 
163
  === Benchmarking ChemBERTa ===
164
  vocab_size : 767
165
+ avg_tokens_per_mol : 25.0359
166
+ compression_ratio : 1.3766
167
  percent_unknown : 0.0000
168
+ encode_throughput_smiles_per_sec : 4585.2022
169
+ decode_throughput_smiles_per_sec : 18168.2779
170
  decode_reconstruction_accuracy : 100.0000
171
 
172
+ === Benchmarking FastChemTokenizerHF ===
173
  vocab_size : 1238
174
+ avg_tokens_per_mol : 13.5668
175
+ compression_ratio : 2.5403
176
  percent_unknown : 0.0000
177
+ encode_throughput_smiles_per_sec : 32005.8686
178
+ decode_throughput_smiles_per_sec : 29807.3610
179
  decode_reconstruction_accuracy : 100.0000
180
  ```
181