HauserGroup
/

ApeTokenizer-SELFIES

+---
+license: mit
+library_name: transformers
+tags:
+- chemistry
+- molecules
+- selfies
+- ape-tokenizer
+- tokenizer
+---
+# ApeTokenizer-SELFIES
+ApeTokenizer-SELFIES is the **Atom Pair Encoding (APE)** tokenizer used by
+[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of
+compact encoder-only transformer models for small-molecule representation
+learning pre-trained on SELFIES strings from ChEMBL 36.
+APE is a byte-pair-style merging scheme applied directly to SELFIES bracket
+tokens, so every token boundary aligns with a chemically valid SELFIES
+primitive. The vocabulary is derived from ~2M unique
+SELFIES strings from ChEMBL 36.
+## Tokenizer Details
+- **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
+- **Input representation:** SELFIES (convert SMILES first; see below)
+- **Algorithm:** Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens
+- **Vocabulary size:** 631
+- **Max merge pieces:** 2
+- **Min merge frequency:** 3000
+- **Training corpus size:** 2M unique SELFIES (ChEMBL 36)
+- **License:** MIT
+- **Repository:** https://github.com/HauserGroup/ModernMolBERT
+| special token | id |
+|---------------|----|
+| `<s>` (BOS) | 0 |
+| `<pad>` | 1 |
+| `</s>` (EOS) | 2 |
+| `<unk>` | 3 |
+| `<mask>` | 4 |
+## How to Get Started
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "HauserGroup/ApeTokenizer-SELFIES",
+    trust_remote_code=True,
+    use_fast=False,
+)
+# A SELFIES string — here aspirin.
+selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"
+tokens = tokenizer.tokenize(selfies)
+print(tokens)
+# ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]']
+inputs = tokenizer(selfies, return_tensors="pt")
+print(inputs["input_ids"])
+# tensor([[  0, 334, 335, 370, 333, 333, 333, 338, 377, 511,   6,   2]])
+```
+If you start from SMILES, convert first:
+```python
+import selfies
+smi = "CC(=O)Oc1ccccc1C(=O)O"
+sf = selfies.encoder(smi)   # '[C][C][=Branch1][C][=O][O][C]...'
+inputs = tokenizer(sf, return_tensors="pt")
+```
+### Using with ModernMolBERT models
+This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from
+the model repo using `subfolder="ape_tokenizer"` to avoid routing
+`AutoTokenizer` to the built-in fast ModernBERT tokenizer:
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "HauserGroup/ModernMolBERT-small",
+    subfolder="ape_tokenizer",
+    trust_remote_code=True,
+    use_fast=False,
+)
+```
+Or load this standalone repo directly as shown above — both produce identical
+tokenizations.
+## Citation
+```bibtex
+@article{madsen_modernmolbert,
+  title  = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
+  author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
+  year   = {2026}
+}
+```
+The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
+tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).