---
license: mit
library_name: transformers
tags:
- chemistry
- molecules
- selfies
- ape-tokenizer
- tokenizer
---

# ApeTokenizer-SELFIES

ApeTokenizer-SELFIES is the **Atom Pair Encoding (APE)** tokenizer used by
[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of
compact encoder-only transformer models for small-molecule representation
learning pre-trained on SELFIES strings from ChEMBL 36.

APE is a byte-pair-style merging scheme applied directly to SELFIES bracket
tokens, so every token boundary aligns with a chemically valid SELFIES
primitive. The vocabulary is derived from ~2M unique
SELFIES strings from ChEMBL 36.

## Tokenizer Details

- **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- **Input representation:** SELFIES (convert SMILES first; see below)
- **Algorithm:** Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens
- **Vocabulary size:** 631
- **Max merge pieces:** 2
- **Min merge frequency:** 3000
- **Training corpus size:** 2M unique SELFIES (ChEMBL 36)
- **License:** MIT
- **Repository:** https://github.com/HauserGroup/ModernMolBERT

| special token | id |
|---------------|----|
| `<s>` (BOS) | 0 |
| `<pad>` | 1 |
| `</s>` (EOS) | 2 |
| `<unk>` | 3 |
| `<mask>` | 4 |

## How to Get Started

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "HauserGroup/ApeTokenizer-SELFIES",
    trust_remote_code=True,
    use_fast=False,
)

# A SELFIES string — here aspirin.
selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"

tokens = tokenizer.tokenize(selfies)
print(tokens)
# ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]']

inputs = tokenizer(selfies, return_tensors="pt")
print(inputs["input_ids"])
# tensor([[  0, 334, 335, 370, 333, 333, 333, 338, 377, 511,   6,   2]])
```

If you start from SMILES, convert first:

```python
import selfies
smi = "CC(=O)Oc1ccccc1C(=O)O"
sf = selfies.encoder(smi)   # '[C][C][=Branch1][C][=O][O][C]...'
inputs = tokenizer(sf, return_tensors="pt")
```

### Using with ModernMolBERT models

This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from
the model repo using `subfolder="ape_tokenizer"` to avoid routing
`AutoTokenizer` to the built-in fast ModernBERT tokenizer:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "HauserGroup/ModernMolBERT-small",
    subfolder="ape_tokenizer",
    trust_remote_code=True,
    use_fast=False,
)
```

Or load this standalone repo directly as shown above — both produce identical
tokenizations.

## Citation

```bibtex
@article{madsen_modernmolbert,
  title  = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
  author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
  year   = {2026}
}
```

The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).