gbyuvd
/

FastChemTokenizer

Feature Extraction

Model card Files Files and versions

gbyuvd commited on Sep 28, 2025

Commit

819b92d

·

verified ·

1 Parent(s): 4348e91

Update README.md

Files changed (1) hide show

README.md +25 -0

README.md CHANGED Viewed

@@ -102,6 +102,8 @@ tokenizer.decode_with_trace(encoded)
 ```
 **for SELFIES**
 ```python
 from FastChemTokenizerHF import FastChemTokenizerSelfies
@@ -125,6 +127,29 @@ tokenizer.decode_with_trace(encoded)
 #  [005] ID=    1 → '</s>'
 ```
 ## 📦 Installation & Usage
 0. Make sure you have all the reqs packages, possibly can be run with different versions

 ```
 **for SELFIES**
+Please don't use the old `FastChemTokenizer` for SELFIES, use the HF one
 ```python
 from FastChemTokenizerHF import FastChemTokenizerSelfies
 #  [005] ID=    1 → '</s>'
 ```
+#### BigSMILES (experimental)
+```python
+from FastChemTokenizer import FastChemTokenizer
+tokenizer = FastChemTokenizer.from_pretrained("./bigsmiles-proto")
+testentry = "*CC(*)c1ccccc1C(=O)OCCCCCC"
+encoded = tokenizer.encode(testentry)
+print("✅ Encoded:", encoded)
+decoded = tokenizer.decode(encoded)
+print("✅ Decoded:", decoded)
+tokenizer.decode_with_trace(encoded)
+# ✅ Encoded: [186, 185, 723, 31, 439]
+# ✅ Decoded: *CC(*)c1ccccc1C(=O)OCCCCCC
+#
+# 🔍 Decoding 5 tokens:
+#   [000] ID=  186 → '*CC(*)'
+#   [001] ID=  185 → 'c1cccc'
+#   [002] ID=  723 → 'c1'
+#   [003] ID=   31 → 'C(=O)OCC'
+#   [004] ID=  439 → 'CCCC'
+```
 ## 📦 Installation & Usage
 0. Make sure you have all the reqs packages, possibly can be run with different versions