gbyuvd commited on
Commit
5b9a060
Β·
verified Β·
1 Parent(s): bae6886

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -2
README.md CHANGED
@@ -15,14 +15,15 @@ tags:
15
 
16
  ## πŸš€ Overview
17
 
18
- `FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
19
 
20
- Trained on ~2.7M valid SMILES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
21
 
22
  For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
23
 
24
  ## ⚑ Performance Highlights
25
 
 
26
  | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
27
  |--------------------------------|-------------------|----------------------|---------------------|
28
  | **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
@@ -34,10 +35,26 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
34
 
35
  βœ… **1.97x faster** than ChemBERTa
36
  βœ… **1.50x faster** than gen-mlm-cismi-bert
 
37
  βœ… **No indexing errors** (avoids >512 token sequences)
38
  βœ… **Zero unknown tokens** on validation set
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
 
 
41
 
42
  ## 🧩 Vocabulary
43
 
@@ -55,6 +72,7 @@ For code and tutorial check this [github project](https://github.com/gbyuvd/Fast
55
  - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
56
  - **Memory Efficient**: Trie traversal and cache
57
 
 
58
  ```python
59
  from FastChemTokenizer import FastChemTokenizer
60
 
@@ -74,9 +92,28 @@ tokenizer.decode_with_trace(encoded)
74
  # [001] ID= 640 β†’ 'cc1'
75
  ```
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ## πŸ“¦ Installation & Usage
79
 
 
80
  1. Clone this repository to a directory
81
  2. Load with:
82
  ```python
 
15
 
16
  ## πŸš€ Overview
17
 
18
+ `FastChemTokenizer` is a **trie-based, longest-match-first tokenizer** specifically designed for efficient tokenization of **SMILES and SELFIES strings** in molecular language modeling. The tokenizer is built from scratch for speed and compactness, it outperforms popular tokenizers like [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s while maintaining 0% UNK rate on ~2.7M dataset and compatibility with Hugging Face `transformers`. In n-grams building, this project uses [seyonec/ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/)'s as early tokenizer for determining n-grams using its token_ids, then uses information-theoretic filtering (entropy reduction, PMI, internal entropy) to extract meaningful statistical chemical motifs β€” then balances 391 backbone (functional) and 391 tail fragments for structural coverage.
19
 
20
+ Trained on ~2.7M valid SMILES and SELFIES built and curated from ChemBL34 (Zdrazil _et al._ 2023), COCONUTDB (Sorokina _et al._ 2021), and Supernatural3 (Gallo _et al._ 2023) dataset; from resulting 76K n-grams -> pruned to **1,238 tokens**, including backbone/tail motifs and special tokens.
21
 
22
  For code and tutorial check this [github project](https://github.com/gbyuvd/FastChemTokenizer)
23
 
24
  ## ⚑ Performance Highlights
25
 
26
+ #### SMILES
27
  | Metric | FastChemTokenizer | [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1/) Tokenizer | [gen-mlm-cismi-bert](https://huggingface.co/smostafanejad/gen-mlm-cismi-bert-wordpiece) |
28
  |--------------------------------|-------------------|----------------------|---------------------|
29
  | **Avg time per SMILES** | **0.0803 ms** | 0.1581 ms | 0.0938 ms |
 
35
 
36
  βœ… **1.97x faster** than ChemBERTa
37
  βœ… **1.50x faster** than gen-mlm-cismi-bert
38
+ βœ… **~19x memory saving** compared to both of the above tokenizer
39
  βœ… **No indexing errors** (avoids >512 token sequences)
40
  βœ… **Zero unknown tokens** on validation set
41
 
42
+ #### SELFIES
43
+ ```
44
+ Core's vocab length = 781 (after pruning)
45
+ with tails = 1161 (after pruning)
46
+ ```
47
+ | Metric | FastChemTokenizer-WTails | FastChemTokenizer-Core | [opti-chemfie-experiment-1](https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel) |
48
+ |--------------------------------|-------------------|----------------------|---------------------|
49
+ | **Avg time per SMILES** | 0.1548 ms | 0.1700 ms | **0.1170 ms** |
50
+ | **Avg sequence length** | **20.34 tokens** | 33.22 tokens | 53.98 tokens |
51
+ | **Throughput** | 6,461/sec | 5,882/sec | **8,549/sec** |
52
+ | **Peak memory usage** | **7.96 MB** | 19.77 MB | 488.03 MB |
53
+ | **UNK token rate** | **0.0000%** | 0.0000% | 0.0000% |
54
+ | **1000 encodes (benchmark)** | **0.0081s** | 2.9020s | 2.9020s |
55
 
56
+ βœ… Even though 1.32x slower, it produces **2.65x lesser tokens**
57
+ βœ… **~61x memory saving with tails** and **~25x** with core
58
 
59
  ## 🧩 Vocabulary
60
 
 
72
  - **HF Compatible**: Implements `__call__`, `encode_plus`, `batch_encode_plus`, `save_pretrained`, `from_pretrained`
73
  - **Memory Efficient**: Trie traversal and cache
74
 
75
+ **for SMILES**
76
  ```python
77
  from FastChemTokenizer import FastChemTokenizer
78
 
 
92
  # [001] ID= 640 β†’ 'cc1'
93
  ```
94
 
95
+ **for SELFIES**
96
+ ```python
97
+ from FastChemTokenizer import FastChemTokenizerSelfies
98
+
99
+ tokenizer = FastChemTokenizerSelfies.from_pretrained("./selftok_wtails") # change to *_core for w/o tails
100
+ benzene = "[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]" # please make sure whitespaced input
101
+ encoded = tokenizer.encode(benzene)
102
+ print("βœ… Encoded:", encoded)
103
+ decoded = tokenizer.decode(encoded)
104
+ print("βœ… Decoded:", decoded)
105
+ tokenizer.decode_with_trace(encoded)
106
+
107
+ # βœ… Encoded: [70]
108
+ # βœ… Decoded: [C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]
109
+
110
+ # πŸ” Decoding 1 tokens:
111
+ # [000] ID= 70 β†’ '[C] [=C] [C] [=C] [C] [=C] [Ring1] [=Branch1]'
112
+ ```
113
 
114
  ## πŸ“¦ Installation & Usage
115
 
116
+ 0. Make sure you have all the reqs packages, possibly can be run with different versions
117
  1. Clone this repository to a directory
118
  2. Load with:
119
  ```python