--- license: mit library_name: transformers tags: - chemistry - molecules - smiles - ape-tokenizer - tokenizer --- # ApeTokenizer-SMILES ApeTokenizer-SMILES is an **Atom Pair Encoding (APE)** tokenizer for SMILES strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is the SMILES counterpart to [ApeTokenizer-SELFIES](https://huggingface.co/HauserGroup/ApeTokenizer-SELFIES), released alongside [ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT). APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms, bonds, ring and branch symbols). Merges are frequency-driven, so a single token may span structural boundaries (for example `(=O)N(`), matching the APE behaviour described in Leon et al. — token boundaries are not guaranteed to be balanced sub-structures. ## Tokenizer Details - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen - **Input representation:** SMILES (canonical) - **Algorithm:** Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces - **Vocabulary size:** 1386 - **Max merge pieces:** 6 - **Min merge frequency:** 3000 - **Training corpus size:** 2M unique canonical SMILES (ChEMBL 36) - **License:** MIT - **Repository:** https://github.com/HauserGroup/ModernMolBERT | special token | id | |---------------|----| | `` (BOS) | 0 | | `` | 1 | | `` (EOS) | 2 | | `` | 3 | | `` | 4 | ## How to Get Started ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "HauserGroup/ApeTokenizer-SMILES", trust_remote_code=True, use_fast=False, ) # A canonical SMILES — here aspirin. smiles = "CC(=O)Oc1ccccc1C(=O)O" tokens = tokenizer.tokenize(smiles) print(tokens) # ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O'] inputs = tokenizer(smiles, return_tensors="pt") print(inputs["input_ids"]) ``` Feed SMILES directly — no SELFIES conversion is needed. This is a standalone alternative tokenizer and is **not** the tokenizer used by the released SELFIES-based ModernMolBERT checkpoints. ## Citation ```bibtex @article{madsen_modernmolbert, title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling}, author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.}, year = {2026} } ``` The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).