--- license: mit library_name: transformers tags: - chemistry - molecules - selfies - ape-tokenizer - tokenizer --- # ApeTokenizer-SELFIES ApeTokenizer-SELFIES is the **Atom Pair Encoding (APE)** tokenizer used by [ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of compact encoder-only transformer models for small-molecule representation learning pre-trained on SELFIES strings from ChEMBL 36. APE is a byte-pair-style merging scheme applied directly to SELFIES bracket tokens, so every token boundary aligns with a chemically valid SELFIES primitive. The vocabulary is derived from ~2M unique SELFIES strings from ChEMBL 36. ## Tokenizer Details - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen - **Input representation:** SELFIES (convert SMILES first; see below) - **Algorithm:** Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens - **Vocabulary size:** 631 - **Max merge pieces:** 2 - **Min merge frequency:** 3000 - **Training corpus size:** 2M unique SELFIES (ChEMBL 36) - **License:** MIT - **Repository:** https://github.com/HauserGroup/ModernMolBERT | special token | id | |---------------|----| | `` (BOS) | 0 | | `` | 1 | | `` (EOS) | 2 | | `` | 3 | | `` | 4 | ## How to Get Started ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "HauserGroup/ApeTokenizer-SELFIES", trust_remote_code=True, use_fast=False, ) # A SELFIES string — here aspirin. selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]" tokens = tokenizer.tokenize(selfies) print(tokens) # ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]'] inputs = tokenizer(selfies, return_tensors="pt") print(inputs["input_ids"]) # tensor([[ 0, 334, 335, 370, 333, 333, 333, 338, 377, 511, 6, 2]]) ``` If you start from SMILES, convert first: ```python import selfies smi = "CC(=O)Oc1ccccc1C(=O)O" sf = selfies.encoder(smi) # '[C][C][=Branch1][C][=O][O][C]...' inputs = tokenizer(sf, return_tensors="pt") ``` ### Using with ModernMolBERT models This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from the model repo using `subfolder="ape_tokenizer"` to avoid routing `AutoTokenizer` to the built-in fast ModernBERT tokenizer: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "HauserGroup/ModernMolBERT-small", subfolder="ape_tokenizer", trust_remote_code=True, use_fast=False, ) ``` Or load this standalone repo directly as shown above — both produce identical tokenizations. ## Citation ```bibtex @article{madsen_modernmolbert, title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling}, author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.}, year = {2026} } ``` The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).