---
license: mit
library_name: transformers
tags:
- chemistry
- molecules
- smiles
- ape-tokenizer
- tokenizer
---
# ApeTokenizer-SMILES
ApeTokenizer-SMILES is an **Atom Pair Encoding (APE)** tokenizer for SMILES
strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is
the SMILES counterpart to
[ApeTokenizer-SELFIES](https://huggingface.co/HauserGroup/ApeTokenizer-SELFIES),
released alongside
[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT).
APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms,
bonds, ring and branch symbols). Merges are frequency-driven, so a single token
may span structural boundaries (for example `(=O)N(`), matching the APE
behaviour described in Leon et al. — token boundaries are not guaranteed to be
balanced sub-structures.
## Tokenizer Details
- **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- **Input representation:** SMILES (canonical)
- **Algorithm:** Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
- **Vocabulary size:** 1386
- **Max merge pieces:** 6
- **Min merge frequency:** 3000
- **Training corpus size:** 2M unique canonical SMILES (ChEMBL 36)
- **License:** MIT
- **Repository:** https://github.com/HauserGroup/ModernMolBERT
| special token | id |
|---------------|----|
| `` (BOS) | 0 |
| `` | 1 |
| `` (EOS) | 2 |
| `` | 3 |
| `` | 4 |
## How to Get Started
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"HauserGroup/ApeTokenizer-SMILES",
trust_remote_code=True,
use_fast=False,
)
# A canonical SMILES — here aspirin.
smiles = "CC(=O)Oc1ccccc1C(=O)O"
tokens = tokenizer.tokenize(smiles)
print(tokens)
# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']
inputs = tokenizer(smiles, return_tensors="pt")
print(inputs["input_ids"])
```
Feed SMILES directly — no SELFIES conversion is needed. This is a standalone
alternative tokenizer and is **not** the tokenizer used by the released
SELFIES-based ModernMolBERT checkpoints.
## Citation
```bibtex
@article{madsen_modernmolbert,
title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
year = {2026}
}
```
The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).