Instructions to use HauserGroup/ApeTokenizer-SMILES with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ApeTokenizer-SMILES with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HauserGroup/ApeTokenizer-SMILES", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| tags: | |
| - chemistry | |
| - molecules | |
| - smiles | |
| - ape-tokenizer | |
| - tokenizer | |
| # ApeTokenizer-SMILES | |
| ApeTokenizer-SMILES is an **Atom Pair Encoding (APE)** tokenizer for SMILES | |
| strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is | |
| the SMILES counterpart to | |
| [ApeTokenizer-SELFIES](https://huggingface.co/HauserGroup/ApeTokenizer-SELFIES), | |
| released alongside | |
| [ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT). | |
| APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms, | |
| bonds, ring and branch symbols). Merges are frequency-driven, so a single token | |
| may span structural boundaries (for example `(=O)N(`), matching the APE | |
| behaviour described in Leon et al. — token boundaries are not guaranteed to be | |
| balanced sub-structures. | |
| ## Tokenizer Details | |
| - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen | |
| - **Input representation:** SMILES (canonical) | |
| - **Algorithm:** Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces | |
| - **Vocabulary size:** 1386 | |
| - **Max merge pieces:** 6 | |
| - **Min merge frequency:** 3000 | |
| - **Training corpus size:** 2M unique canonical SMILES (ChEMBL 36) | |
| - **License:** MIT | |
| - **Repository:** https://github.com/HauserGroup/ModernMolBERT | |
| | special token | id | | |
| |---------------|----| | |
| | `<s>` (BOS) | 0 | | |
| | `<pad>` | 1 | | |
| | `</s>` (EOS) | 2 | | |
| | `<unk>` | 3 | | |
| | `<mask>` | 4 | | |
| ## How to Get Started | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "HauserGroup/ApeTokenizer-SMILES", | |
| trust_remote_code=True, | |
| use_fast=False, | |
| ) | |
| # A canonical SMILES — here aspirin. | |
| smiles = "CC(=O)Oc1ccccc1C(=O)O" | |
| tokens = tokenizer.tokenize(smiles) | |
| print(tokens) | |
| # ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O'] | |
| inputs = tokenizer(smiles, return_tensors="pt") | |
| print(inputs["input_ids"]) | |
| ``` | |
| Feed SMILES directly — no SELFIES conversion is needed. This is a standalone | |
| alternative tokenizer and is **not** the tokenizer used by the released | |
| SELFIES-based ModernMolBERT checkpoints. | |
| ## Citation | |
| ```bibtex | |
| @article{madsen_modernmolbert, | |
| title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling}, | |
| author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.}, | |
| year = {2026} | |
| } | |
| ``` | |
| The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES | |
| tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024). | |