---
license: mit
library_name: transformers
tags:
- chemistry
- molecules
- smiles
- ape-tokenizer
- tokenizer
---

# ApeTokenizer-SMILES

ApeTokenizer-SMILES is an **Atom Pair Encoding (APE)** tokenizer for SMILES
strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is
the SMILES counterpart to
[ApeTokenizer-SELFIES](https://huggingface.co/HauserGroup/ApeTokenizer-SELFIES),
released alongside
[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT).

APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms,
bonds, ring and branch symbols). Merges are frequency-driven, so a single token
may span structural boundaries (for example `(=O)N(`), matching the APE
behaviour described in Leon et al. — token boundaries are not guaranteed to be
balanced sub-structures.

## Tokenizer Details

- **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- **Input representation:** SMILES (canonical)
- **Algorithm:** Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
- **Vocabulary size:** 1386
- **Max merge pieces:** 6
- **Min merge frequency:** 3000
- **Training corpus size:** 2M unique canonical SMILES (ChEMBL 36)
- **License:** MIT
- **Repository:** https://github.com/HauserGroup/ModernMolBERT

| special token | id |
|---------------|----|
| `<s>` (BOS) | 0 |
| `<pad>` | 1 |
| `</s>` (EOS) | 2 |
| `<unk>` | 3 |
| `<mask>` | 4 |

## How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "HauserGroup/ApeTokenizer-SMILES",
    trust_remote_code=True,
    use_fast=False,
)

# A canonical SMILES — here aspirin.
smiles = "CC(=O)Oc1ccccc1C(=O)O"

tokens = tokenizer.tokenize(smiles)
print(tokens)
# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']

inputs = tokenizer(smiles, return_tensors="pt")
print(inputs["input_ids"])
```

Feed SMILES directly — no SELFIES conversion is needed. This is a standalone
alternative tokenizer and is **not** the tokenizer used by the released
SELFIES-based ModernMolBERT checkpoints.

## Citation

```bibtex
@article{madsen_modernmolbert,
  title  = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
  author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
  year   = {2026}
}
```

The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).