jsture commited on
Commit
fdd34dc
·
verified ·
1 Parent(s): 891e224

Add APE SELFIES tokenizer with max length 256

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ tags:
5
+ - chemistry
6
+ - molecules
7
+ - selfies
8
+ - ape-tokenizer
9
+ - tokenizer
10
+ ---
11
+
12
+ # ApeTokenizer-SELFIES
13
+
14
+ ApeTokenizer-SELFIES is the **Atom Pair Encoding (APE)** tokenizer used by
15
+ [ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of
16
+ compact encoder-only transformer models for small-molecule representation
17
+ learning pre-trained on SELFIES strings from ChEMBL 36.
18
+
19
+ APE is a byte-pair-style merging scheme applied directly to SELFIES bracket
20
+ tokens, so every token boundary aligns with a chemically valid SELFIES
21
+ primitive. The vocabulary is derived from ~2M unique
22
+ SELFIES strings from ChEMBL 36.
23
+
24
+ ## Tokenizer Details
25
+
26
+ - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
27
+ - **Input representation:** SELFIES (convert SMILES first; see below)
28
+ - **Algorithm:** Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens
29
+ - **Vocabulary size:** 631
30
+ - **Max merge pieces:** 2
31
+ - **Min merge frequency:** 3000
32
+ - **Training corpus size:** 2M unique SELFIES (ChEMBL 36)
33
+ - **License:** MIT
34
+ - **Repository:** https://github.com/HauserGroup/ModernMolBERT
35
+
36
+ | special token | id |
37
+ |---------------|----|
38
+ | `<s>` (BOS) | 0 |
39
+ | `<pad>` | 1 |
40
+ | `</s>` (EOS) | 2 |
41
+ | `<unk>` | 3 |
42
+ | `<mask>` | 4 |
43
+
44
+ ## How to Get Started
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained(
50
+ "HauserGroup/ApeTokenizer-SELFIES",
51
+ trust_remote_code=True,
52
+ use_fast=False,
53
+ )
54
+
55
+ # A SELFIES string — here aspirin.
56
+ selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"
57
+
58
+ tokens = tokenizer.tokenize(selfies)
59
+ print(tokens)
60
+ # ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]']
61
+
62
+ inputs = tokenizer(selfies, return_tensors="pt")
63
+ print(inputs["input_ids"])
64
+ # tensor([[ 0, 334, 335, 370, 333, 333, 333, 338, 377, 511, 6, 2]])
65
+ ```
66
+
67
+ If you start from SMILES, convert first:
68
+
69
+ ```python
70
+ import selfies
71
+ smi = "CC(=O)Oc1ccccc1C(=O)O"
72
+ sf = selfies.encoder(smi) # '[C][C][=Branch1][C][=O][O][C]...'
73
+ inputs = tokenizer(sf, return_tensors="pt")
74
+ ```
75
+
76
+ ### Using with ModernMolBERT models
77
+
78
+ This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from
79
+ the model repo using `subfolder="ape_tokenizer"` to avoid routing
80
+ `AutoTokenizer` to the built-in fast ModernBERT tokenizer:
81
+
82
+ ```python
83
+ from transformers import AutoTokenizer
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained(
86
+ "HauserGroup/ModernMolBERT-small",
87
+ subfolder="ape_tokenizer",
88
+ trust_remote_code=True,
89
+ use_fast=False,
90
+ )
91
+ ```
92
+
93
+ Or load this standalone repo directly as shown above — both produce identical
94
+ tokenizations.
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @article{madsen_modernmolbert,
100
+ title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
101
+ author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
102
+ year = {2026}
103
+ }
104
+ ```
105
+
106
+ The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
107
+ tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).