Instructions to use HauserGroup/ApeTokenizer-SELFIES with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ApeTokenizer-SELFIES with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HauserGroup/ApeTokenizer-SELFIES", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| tags: | |
| - chemistry | |
| - molecules | |
| - selfies | |
| - ape-tokenizer | |
| - tokenizer | |
| # ApeTokenizer-SELFIES | |
| ApeTokenizer-SELFIES is the **Atom Pair Encoding (APE)** tokenizer used by | |
| [ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of | |
| compact encoder-only transformer models for small-molecule representation | |
| learning pre-trained on SELFIES strings from ChEMBL 36. | |
| APE is a byte-pair-style merging scheme applied directly to SELFIES bracket | |
| tokens, so every token boundary aligns with a chemically valid SELFIES | |
| primitive. The vocabulary is derived from ~2M unique | |
| SELFIES strings from ChEMBL 36. | |
| ## Tokenizer Details | |
| - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen | |
| - **Input representation:** SELFIES (convert SMILES first; see below) | |
| - **Algorithm:** Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens | |
| - **Vocabulary size:** 631 | |
| - **Max merge pieces:** 2 | |
| - **Min merge frequency:** 3000 | |
| - **Training corpus size:** 2M unique SELFIES (ChEMBL 36) | |
| - **License:** MIT | |
| - **Repository:** https://github.com/HauserGroup/ModernMolBERT | |
| | special token | id | | |
| |---------------|----| | |
| | `<s>` (BOS) | 0 | | |
| | `<pad>` | 1 | | |
| | `</s>` (EOS) | 2 | | |
| | `<unk>` | 3 | | |
| | `<mask>` | 4 | | |
| ## How to Get Started | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "HauserGroup/ApeTokenizer-SELFIES", | |
| trust_remote_code=True, | |
| use_fast=False, | |
| ) | |
| # A SELFIES string — here aspirin. | |
| selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]" | |
| tokens = tokenizer.tokenize(selfies) | |
| print(tokens) | |
| # ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]'] | |
| inputs = tokenizer(selfies, return_tensors="pt") | |
| print(inputs["input_ids"]) | |
| # tensor([[ 0, 334, 335, 370, 333, 333, 333, 338, 377, 511, 6, 2]]) | |
| ``` | |
| If you start from SMILES, convert first: | |
| ```python | |
| import selfies | |
| smi = "CC(=O)Oc1ccccc1C(=O)O" | |
| sf = selfies.encoder(smi) # '[C][C][=Branch1][C][=O][O][C]...' | |
| inputs = tokenizer(sf, return_tensors="pt") | |
| ``` | |
| ### Using with ModernMolBERT models | |
| This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from | |
| the model repo using `subfolder="ape_tokenizer"` to avoid routing | |
| `AutoTokenizer` to the built-in fast ModernBERT tokenizer: | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "HauserGroup/ModernMolBERT-small", | |
| subfolder="ape_tokenizer", | |
| trust_remote_code=True, | |
| use_fast=False, | |
| ) | |
| ``` | |
| Or load this standalone repo directly as shown above — both produce identical | |
| tokenizations. | |
| ## Citation | |
| ```bibtex | |
| @article{madsen_modernmolbert, | |
| title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling}, | |
| author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.}, | |
| year = {2026} | |
| } | |
| ``` | |
| The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES | |
| tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024). | |