Update README.md
Browse files
README.md
CHANGED
|
@@ -145,7 +145,7 @@ outputs = tokenizer.batch_encode_plus(smiles_list, padding=True, truncation=True
|
|
| 145 |
|
| 146 |
|
| 147 |
## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
|
| 148 |
-
1st Epoch, on ~
|
| 149 |
|
| 150 |
Latent Space Visualization based on SMILES Interpolation Validity
|
| 151 |
|
|
@@ -156,29 +156,26 @@ using smitok (with tails)
|
|
| 156 |

|
| 157 |
|
| 158 |
```text
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
Train: 6484
|
| 163 |
-
Val: 811
|
| 164 |
-
Test: 811
|
| 165 |
|
| 166 |
=== Benchmarking ChemBERTa ===
|
| 167 |
vocab_size : 767
|
| 168 |
-
avg_tokens_per_mol :
|
| 169 |
-
compression_ratio : 1.
|
| 170 |
percent_unknown : 0.0000
|
| 171 |
-
encode_throughput_smiles_per_sec :
|
| 172 |
-
decode_throughput_smiles_per_sec :
|
| 173 |
decode_reconstruction_accuracy : 100.0000
|
| 174 |
|
| 175 |
-
=== Benchmarking
|
| 176 |
vocab_size : 1238
|
| 177 |
-
avg_tokens_per_mol :
|
| 178 |
-
compression_ratio : 2.
|
| 179 |
percent_unknown : 0.0000
|
| 180 |
-
encode_throughput_smiles_per_sec :
|
| 181 |
-
decode_throughput_smiles_per_sec :
|
| 182 |
decode_reconstruction_accuracy : 100.0000
|
| 183 |
```
|
| 184 |
|
|
|
|
| 145 |
|
| 146 |
|
| 147 |
## 📚 Early VAE Evaluation (vs. ChemBERTa's) [WIP for Scaling]
|
| 148 |
+
1st Epoch, on ~13K samples of len(token_ids)<=25; embed_dim=64, hidden_dim=128, latent_dim=64, num_layers=2; batch_size= 16 * 4 (grad acc)
|
| 149 |
|
| 150 |
Latent Space Visualization based on SMILES Interpolation Validity
|
| 151 |
|
|
|
|
| 156 |

|
| 157 |
|
| 158 |
```text
|
| 159 |
+
Train: 13017
|
| 160 |
+
Val: 1627
|
| 161 |
+
Test: 1628
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
=== Benchmarking ChemBERTa ===
|
| 164 |
vocab_size : 767
|
| 165 |
+
avg_tokens_per_mol : 25.0359
|
| 166 |
+
compression_ratio : 1.3766
|
| 167 |
percent_unknown : 0.0000
|
| 168 |
+
encode_throughput_smiles_per_sec : 4585.2022
|
| 169 |
+
decode_throughput_smiles_per_sec : 18168.2779
|
| 170 |
decode_reconstruction_accuracy : 100.0000
|
| 171 |
|
| 172 |
+
=== Benchmarking FastChemTokenizerHF ===
|
| 173 |
vocab_size : 1238
|
| 174 |
+
avg_tokens_per_mol : 13.5668
|
| 175 |
+
compression_ratio : 2.5403
|
| 176 |
percent_unknown : 0.0000
|
| 177 |
+
encode_throughput_smiles_per_sec : 32005.8686
|
| 178 |
+
decode_throughput_smiles_per_sec : 29807.3610
|
| 179 |
decode_reconstruction_accuracy : 100.0000
|
| 180 |
```
|
| 181 |
|